LWW Assessment
Introduction
This report is intended to provide a unified overview of the resources,
environments, and facilities of the Library
Without Walls (LWW). It is intended that this information will
server the purpose of designing, in the future, appropriate methods to
integrate software developments in the Active
Recommendation Systems for a Library Without Walls project (which we
will call the Active Recommendation Project (ARP)) into the LWW.
Below we consider the following issues:
-
The database engines and formats currently in use by the LWW, and those
anticipated in the near future.
-
The components of the LWW "meta-base", including their content, structure,
and engineering.
-
Current and anticipated linkages and crossover amongst these components,
including content, physical records, and document references.
-
Engineering issues of read and write access for the ARP.
There are a few items which are incomplete in this report, in particular
a detailed consideration of the content crossover among the various LWW
corpi. Also, we rely on information provided externally on the LWW network
architecture and server configurations.
Database Formats
It will be useful first to identify and describe the various database environments
used within the LWW.
Advance
Advance is a database system maintained and enhanced by the LWW based on
legacy systems used to support library cataloging with MARC format records.
It is based on PIC, with some C code. PIC is a legacy hybrid OS-database
environment which is currently supported in an emulation mode within the
Universe product running on UNIX. In this context, it appears as a flat-file
system with relational capabilities. An SQL interface is available within
Universe. Implementation is in terms of a great many separate UNIX exectuables
wtih a complex interaction. Advance is optimized for variable-length text
storage, and provides direct (machine-level) access to data records. Data
typing is partial and dynamic, based on text fields. Numeric and boolean
types are supported contextually based on available operations.
Within Advance documents are identified with DocIDs
(below).
The Advance architecture is as follows. HTML form information is passed
through UNIX pipes to the WebZ gateway, which then parses it and constructs
a Z39.50 query. This is then sent in Z39.50 format to Advance, and in particular
to the Z39.50 server which is a portion of Advance. Other portions of Advance
then parse and execute the search, return DocIDs, format MARC records,
retrieve URLs, etc. This information is then passed back to the Z39.50
server, sent in Z39.50 format back to WebZ, which then formats an HTML
response.
Each different Advance corpus has its own set of indexed fields. Some
(e.g. title) are common to all corpi, others (e.g. taxonomic classification
in Biosis) are specialized to particular corpi. Some (e.g. subject) differ
semantically among corpi, but can be identified with each other. To represent
this, 22 common index "slots" available, which each special index terms
mapped into a particular slot. The following table shows this mapping.
| Index Slot |
Main OPAC |
BIOSIS |
DOE Energy |
Eng Index |
INSPEC |
| A |
Author |
Author |
Author |
Author |
Author |
| B |
Organization |
Assignee |
Organization |
Sponsor |
Sponsor |
| D |
Conference |
Conference |
Conference |
Conference |
Conference |
| E |
TOC author |
Institution |
Affiliation |
Affiliation |
Institution |
| F |
TOC title |
Journal |
|
Journal |
Journal |
| G |
Date |
Date |
Date |
Date |
Date |
| H |
Publisher |
Publisher |
Publisher |
Publisher |
Publisher |
| I |
Format |
Doc Type |
Doc Type |
Doc Type |
Doc Type |
| J |
Genre |
|
Material |
|
Affiliation |
| K |
TOC general |
|
|
|
|
| L |
Language |
Language |
Language |
Language |
Language |
| N |
Note |
Note |
Note |
|
|
| O |
Abstract |
Abstract |
Abstract |
Abstract |
Abstract |
| P |
Series |
Sequence data |
|
|
Assignee |
| Q |
Journal |
Issue |
Journal |
Issue |
Issue |
| R |
Url |
|
|
|
Classification |
| S |
Subject |
Subject |
Patent |
|
Subject |
| T |
Title |
Title |
Title |
Title |
Title |
| U |
|
Taxonomy |
Major Subject |
Subject |
FreeTerm |
| V |
|
|
Minor Subject |
Descriptor |
Numeric data |
| W |
Subject |
Concept |
|
Classification |
Chemical |
| Y |
Subject |
Other |
|
FreeTerm |
Astronomy |
Advance Index Slots
For each of the top 5,000 words queried, multiple keylists are prepared,
one for each word, for each of a number of collections of slots. Each keylist
contains a list of the documents containing the word, in DocID format.
Keylists are updated generally every week.
Single-word queries first check for pre-made keylists, otherwise the
keylist is constructed at query time. Multiple-word queries are broken
into multiple single-word queries, whose results are then combined with
standard UNIX text manipulation commands (e.g. sort, merge, uniq).
Topic/Verity
Topic is the name of a database product of the Verity coroporation. They
are actually phasing this name out, so we will refer to it as "Verity"
only. Verity provides a hierarchically structured database which is highly
optimized for free text retrieval, and supports some relatively sophisticated
Boolean capabilities for classical information retrieval.
Topic is structured as follows.
-
A server can address multiple collections.
-
A collection consists of multiple documents. In principle, the same
document can exist in multiple collections. In practice (i.e. in SciSearch)
this is not the case, but if it were, the user would have to link the records
together manually.
-
A document consists of fields, structured fields, and zones.
-
Zones are optimized for indexing.
-
Structured fields are like composite records or C structures.
-
A style file determines the common format of documents within collections.
-
The entire document can also be addressed as a whole for free-text retrieval.
This is actually the fastest lookup method available.
Interface with Verity is through an API. Currently, a LANL product, Explorer,
is the interface. The SciSearch interface passes HTML form information
to Explorer, which constructs Verity API calls, queries the Verity engine,
retrieves the results, and formats HTML for output.
This HTML/Explorer/Verity architecture is not unique to the LWW, but
rather is used lab-wide in a number of different Explorer flavors for a
number of different support environments. In the LWW, Verity is currently
specialized to the SciSearch database from ISI
(below).
Verity is highly optimized for text storage and retrieval, and retrieval
of non-indexed text fields is feasible from any collection. Support for
numeric data is weak, for example there is no indexing. Similarly, there
is no relational information. Further, no linking capability is available
among Verity collections.
ScienceServer/MPS
This is a proprietery database originally produced by the Orion corporation,
and supplmented by front-end routines called the MPS Information Server,
and indexing routines provided by FSConsulting. It serves the EJournals
(below). We will call this the MPS system.
Similar to Verity, MPS is a dedicated information retrieval product
on a proprietary database format. It supports a variety of boolean and
other search capabilities (see http://www.fsconsult.com/products/mps-server.html).
The LWW currently has a binary license for MPS, but will be receiving a
source license shortly.
Oracle
The above environments are currently available in the LWW. The ARP may
require access to an ODBC-compliant relational environment such as Oracle.
There are currently plans to bring up an instance of Oracle on a PC
from Sandia National Laboratory in order to provide a dedicated environment
on which SciSearch interaction with the VxInsight visualization product
can be demonstrated. This machine will generally not be available to the
ARP.
Alternatively, it has been suggested that an Oracle instance might be
made available on the Sun Ultra 3000 soon-to-be development machine (see
Server
Configurations below), or on a new Linux machine.
Components of the LWW "Meta-Base"
The LWW currently consists of a number of distinct components, which we
will call corpi. A major task of Phase III of the LWW, and of the ARP,
is to try to integrate these corpi in various ways.
General Issues
For each of the components, we focus on a number of key issues:
-
Content and Structure: What kind of information is kept, from what
sources, and in what formats or "granularities"?
-
Keywords and Indexing: How are searches faciliated by either distinct
or "generatable" keywords?
-
Controlled Vocabularies: Are any used?
-
Engineering and Interface: What database holds the information,
and how can it thereby be accessed?
-
Sizing and Updates: How big are the corpi? How often and in what
proportion are they updated?
-
Ancillary Data: Is other ancillary information associated with the
component available, even if not stored directory in the corpus (e.g. historical
SciSearch queries).
-
Availability: Which LANL customers or associates can access the
data?
The main "card catalog" for the library, called the OPAC, is held in an
Advance database which is Z39.50 compliant for inter-library use. It has
a special Web interface emulating Explorer. It is currently intended for
the OPAC to remain as a legacy Advance corpus.
OPAC contains records for listings for all materials accessible through
the Research Library which have MARC records. This includes:
-
Physical Holdings: Books, journals, proceedings, and grey literature.
-
Electronic Holdings: Electronic journals, electronic databases.
These listings are at the "title" (i.e. journal title) rather than "article"
(i.e. article title) level. Thus each of the other LWW corpi below has
a single entry, as does the OPAC corpus itself.
As a database of MARC records, the OPAC benefits from keywords, and
the controlled vocabulary maintained by the Library of Congress (LC) system.
Thus most ISBN and ISSN entries have correspoding LC numbers.
Scientific Catalogs
The LWW maintains a variety of catalogs of bilbiographic, abstract, and
indexing information provided by third parties. They are all currently
in the Advance format, although Biosis is in the process of being converted
to Verity. It is intended that all Advance databases will eventually be
converted to Verity.
From the web page:
"INSPEC (Information Service in Physics, Electrotechnology
and Control) is the leading English-language citation database for the
world's literature on all aspects of physics, electronics, and computing.
INSPEC scans papers from approximately 4,200 journals, 1000 conferences,
and other publications. INSPEC corresponds to the print publications Physics
Abstracts, Electrical and Electronics Abstracts, and Computer and Control
Abstracts.
Subjects: Astronomy; Chemistry; Computer Science; Engineering; Earth
Sciences; Mathematics; Nuclear Information; Physics; Science (General/Popular)."
Inspec is provided by IEEE, and includes books, journals, conference, and
some grey literature. Keywords are provided initially from authors and
editors, and matched into the IEEE hierararchical knowledge taxonomy to
provide a controlled vocabulary. Matches are identified, and non-matches
are included as "free terms". Access is restricted to LANL and UNM.
From the web page:
"Citations, with abstracts, from Biological abstracts and Biological
abstracts/RRM (reports, reviews, meetings). BIOSIS indexes approximately
6500 journals and 2000 meetings/year, as well as books and other materials.
Subject coverage includes biological and biomedical sciences, botany, biochemistry,
biophysics, biotechnology, medicine, public health, radiation biology,
ecology and the environment.
Subjects: Biology/Genetics; Environment"
Contains biological material only, as opposed to medical. Includes books,
journals, proceedings, and some grey literature. Keyword control is similar
to Inspec. Indexing terms are quite elaborate, including taxonomic classification,
sequencing data, etc. Access is limitted to LANL and Stanford.
From the web page:
"Engineering Index provides worldwide coverage of approximately
4,500
journals, and numerous conference proceedings, reports, books and dissertations.
Over half a million records are for papers in published proceedings. Subjects
covered are civil, energy, environmental, geological, and biological engineering;
electrical, electronics, and control engineering; chemical, mining, metals,
and fuel engineering; mechanical, automotive, nuclear, and aerospace engineering;
and computers, robotics, and industrial robots. Engineering Index at LANL
corresponds to the print publication Engineering Index, with additional
coverage of conference papers and proceedings.
Subjects: Engineering; Earth Sciences; Nuclear Information; Reports"
Provided by Elsevier. Includes books, journals, proceedings, and some grey
literature. Access limitted to LANL.
From the web page:
"A multi-disciplinary database containing worldwide references
to basic and applied scientific and technical research literature. The
primary focus is on energy and related topics. The database includes references
to publications provided by the U.S. Department of Energy, its contractors,
and other government agencies; also information from the International
Energy Agency's Energy Technology Data Exchange (ETDE) and the International
Atomic Energy Agency's International Nuclear Information System (INIS).
Approximately half of the references are from sources outside the United
States. Abstracts are included for records from 1976 to the present. Approximately
half of the references are to journal literature and 25% to technical report
literature.
Subjects: Biology/Genetics; Chemistry; Computer Science; Defense; Environment;
Engineering; Mathematics; Nuclear Information; Patents; Physics; Reports"
Provided by DOE OSTI. Derived from the old Nuclear Science Abstracts (through
1972). Focuses primarily on DOE research reports and some limitted journal
coverage. Keyword vocabulary controlled from a thesaraus and hierarchical
classification. Free terms maintained. Limitted to LANL.
This is the premier citation database available for scientific literature,
and is provided by the Institute for Scientific Information (ISI). SciSearch
holds abstract and indexing information about a broad and deep collection
of scientific journals, and most importantly, cross-referencing infomation
about citations among published papers.
The Social SciSearch database is coming online soon, and holds a similar
collection of social scientific literature. Here we focus on regular SciSearch
only.
Keywords come in two varieties:
-
Author-supplied keywords are passed directly through.
-
"Keywords plus" are generated by ISI using an unpublished algorithm which
works basically as follows. First a list of the referring papers which
are present in the corpus, and then the union of the title words from those
papers, is constructed. Those words occurring multiple times are then listed
as keywords plus. It is unknown if generated keywords plus which are also
keywords are retained in either or both lists.
SciSearch is available to LANL staff and to a variety of partners (see
http://scisearch2.lanl.gov).
SciSearch Architecture and Loads
Weekly SciSearch loads proceed as follows:
-
Nroff formatted raw files are ftp'ed from ISI.
-
A parsing process then produces formatted text files quite similar
to the output from e.g. SciSearch alerts. Article boundaries within the
text files are delimitted with a tag.
-
The text files store information on the number of times the article has
been cited, and whether there is a link to an electronic version of the
paper (e.g. through the EJournals facility,
below). This information is updated within the text files monthly.
-
With each load, an index of internal CitationIDs
(see below) is also generated.
-
At query time, these text files are formatted into HTML.
-
The text files are then parsed and used to update two Verity Collections.
-
The main collection stores everything except citation information.
-
The cited collection stores author, title, volume, and journal,
in addition to citations.
Note that:
Information on which articles cite a given article are stored in the Verity
Collections only, and not in the text files.
Conversely, detailed reference information on the articles cited in a particular
article are stored only in the text files, whereas in the Verity collection
this information is represented only indirectly through a list of appropriate
CitationIDs, which can then be queried iteratively.
Communication from the Verity Collections back to the text files is facilitated
by a Verity structured field holding, for each article, its text
file name and byte offset.
Historical Queries
Queries sent to SciSearch are stored for historical purposes in human-readable,
ASCII log files containing the following information:
-
IP sending query
-
Time stamp
-
Query text
-
Number of results
Note that the range of years requested in the search is not available.
Standing alerts are not logges distinctly, but information about them
is available. In particular, every week standing alerts generate appropriate
standard queries, which are then recorded in the log files. Queries generated
from alerts can be distinguished from queries generated manually by the
fact that they include ranges of internal Verity record identifiers used
to limit queries to the weeks in question.
The Electronic Journals (EJournals) portion of the LWW provides desktop
access to the content of scientific journals. Journals from a number of
publishers are stored locally, while the others are linked to externally.
Here we focus on locally stored journals, since we will experience less
problemattic access to them for the ARP.
The LWW has entered into a close relationship with Elsevier (not actually
a publisher, but rather a distributor of multiple publishers such
as Pergamon), other publishing companies, and FSConsult, in order to provide
desktop access to a wide variety of electronic journals. These parties
all have a mutual interest in working together to develop these capabilities.
We will focus specifically on Elsevier here, which is the major effort,
but Academic Press is coming behind.
On a weekly basis, CDs are received from the publisher. A TOC file (in
a tagged flat-file format) contains abstract information per issue on the
CD, including authors, title, keywords, etc. Papers are available in PDF,
SGML, and raw text. LWW code runs an indexing routine (both the TOC and
the document text is indexed), creates SICIs, and updates the SICI
database (see below).
No citation information is available in indexed form, although in principle
it can be abstracted at least from the raw text copies of the papers, if
not the tagged SGML versions.
A subject specialist LWW librarian makes LANL subject (category) classifications
of all EJournals. This is how they are grouped on the web page.
EJournals are restricted to LANL and the Air Force Research Library,
although discussions are ongoing with others (e.g. Stanford).
This database represents a very innovative project in electronic publishing
and communication among working scientists, with the potential to point
the way towards a revolution in the sociology of science. We anticipate
being able to have more information about this corpus, and how it might
interact with the LWW, in the near future.
Other Corpi
From the web page:
"The Los Alamos Unclassified Publications Database is a multi-disciplinary
database containing references to basic and applied scientific and
technical literature authored by the staff of Los Alamos National
Laboratory, its contractors, and collaborators."
This corpus currently exists in a hybrid Explorer-Advance database where
a particular instance of Explorer is used to access an Advance database.
LANL Technical Reports
A corpus of classified reports exists in a secure vault in the library.
It is another instance of an Advance interface to a collection of PDF documents.
Patents
This corpus currently exists in the Barn Owl network tool. It is not currently
accessible to users, and has no GUI.
Citation Analysis
Yearly reports are generated from SciSearch concerning the following
issues:
-
Ten most cited LANL article.
-
Top ten journal in which LANL articles appear.
-
Top ten categories in which LANL articles appear.
-
Number of citations vs. number of LANL articles published.
-
Some other statistics.
These are maipulated in an Excel spreadsheet and converted to HTML for
user access.
Summary
Summary information is provided below for relavent corpi.
| Database |
Format |
Keywords |
Size |
Update/interval
(avg) |
Started |
Searches/wk
(avg) |
| OPAC |
Advance |
Library of Congress |
250 KRecs
3 GB |
Continuous |
1800's |
7000 |
| Inspec |
Advance |
IEEE: Controlled and Free |
6.2 MRecs
64 GB |
6 KRecs/wk
35 MB/wk |
1969 |
700 |
| Biosis |
Advance |
Biosis: Controlled and Free |
11.6 MRecs
95 GB |
8 KRecs/1.3wk
50 MB/1.3wk |
1969 |
1000 |
| EngInd |
Advance |
** |
4.3 MRecs
38 GB |
4 KRecs/wk
15 MB/wk |
1969 |
400 |
| DOEng |
Advance |
OSTI: Controlled and Free |
3.8 MRecs
38 GB |
6 KRecs/2wk
20 MB/2wk |
1940's |
250 |
| SciSearch |
Verity |
Authors and Keywords+ |
16.5 MRecs
50 GB |
20 Krecs/wk
50 MB/wk |
1991 |
5200 (LANL)
9800 (other)
|
| EJournals |
MPS |
Authors |
?
300 GB |
?
1 GB/wk |
1994 |
100 Journals
1000 Articles
|
| Unclassified Publications |
Advance |
Partial: DOE or LC |
54 KRecs
500 MB |
Irregular |
1943 |
**Irma
|
| LANL Patents |
Barn Owl |
|
|
|
|
|
The LWW Meta-Base.
Current and Anticipated Linkages and Crossover
Content Crossover
Here we consider the amount of content which is shared among the Advance
scientific corpi, the EJournals, and SciSearch. This is currently rather
difficult information to obtain. LWW staff have been able to provide a
few rough estimates based on their particular experience, and it will be
possible in the future to gather analytical information about, for example,
the number of journal titles which are held in common among the different
corpi, and possibly the number of article titles. For the present, however,
this section is a skeletal and incomplete representation of information
which will await future analysis.
Generally, however, we are informed by the following considerations.
-
SciSearch is distinct from the others in that it is a very broad, but relatively
shallow corpus, hitting the "top journals" in most fields of science.
-
The Advance scientific corpi, on the other hand, are much more comprehensive
within their own fields, including, for example, conference proceedings.
-
Finally, the EJournals are also deep, but are effectively "samples" from
the overall scientific literature, being grouped by publisher.
Specifically, we observe the following:
-
Biosis-Inspec: Small overlap in medical physics, radiography, health
physics, and simulation in common.
-
Biosis-SciSearch: Moderate overlap of top biological literature.
When completed, the following table will show rough estimates (qualitative
or quantitative) by LWW staff of the amount of data which is held redundantly
in multiple corpi. An entry in a cell indicates that that is the amount
of the corpus in that row which is held within the corpus of that column.
This is a generally assymetric relation.
|
Inspec |
Biosis |
EngInd |
DOEng |
SciSearch |
EJournals |
| Inspec |
|
20% |
|
|
|
|
| Biosis |
20% |
|
|
|
Moderate |
|
| EngInd |
|
|
|
|
|
|
| DOEng |
|
|
|
|
|
|
| SciSearch |
|
|
|
|
|
90% |
| EJournals |
|
|
|
|
60% |
|
Estimated Overlaps among Corpi
Database Crossover
Here we consider physical movements which either exist or are anticipated
among the LWW corpi.
Importing xxx.lanl.gov
There are plans currently to integrate xxx.lanl.gov with the LWW. This
project will begin in 1Q 1999, and be carried out be staff who will work
in conjunction with the ARP.
Movement to Topic
Plans are currently to migrate the Advance databases in stages to Verity.
The first candidate is Biosis, which wll likey be available in Verity in
either 1Q or 2Q 1999.
Document Reference Crossover
In order to facilitate crossover among corpi, it is necessary to have unique
keys which can identify documents existing in multiple corpi. We first
describe the references available, and then describe their relations.
Document References
Document Object Identifier (DOI)
DOIs are being advanced by the publishing industry to support universal
references to electronically published documents. They act effectively
as URL down to the directory level, with final access to the article level
unspecified. Although DOIs are available in Inspec, they are not currently
supported by the LWW.
Serial Item Contribution Identifier (SICI)
SICIs are currently the most universally available inter-corpus reference.
Multiple SICI formats for different publication types are defined in an
ANSI standard established as a cooperative venture involving the academic
and the digital library communities.
SICIs using the "article wthin serial publication" format are contstructed
at load-time in all of the the Advance, EJournal, and SciSearch databases.
The syntax used in the Advance and EJournal databases is:
ISSN(yyyy[mm[dd]])vol[:issue[:part]]<page:title_code>control
check_digit
where:
-
fixed-width font is literal.
-
[x] is an optional item.I
-
ISSN is the ISSSN.
-
yyyy is the year.
-
mm is the month.
-
dd is the day.
-
vol is the volume number, or otherwise the highest-level available "enumeration"
information.
-
issue is the second-level enumeration.
-
part is the third-level enumeration.
-
page is the first page number.
-
title_code is a string consisting of the first letter of up to the first
six words of the title.
-
control is a constant.
-
check_digit is a check digit.
It is regretably not possible to match full SICIs from queries into the
database, due to invonsistencies in the publisher-supplied databases. In
order to provide unambiguous matches, it is actually necessary to construct
one full and five partial SICIs for each document, as follows:
-
Title only
-
Date only
-
Enumeration only
-
Neither date nor enum
-
Neither date nor title
SICIs are not currently indexed in Advance.
Due to the limitations of indexing information provided by ISI, SciSearch
maintains its own format of the SICI, with syntax:
ISSN()vol[:issue[:part]]<page:>
where the fields are above. This is due to be updated, but currently only
75% of SciSearch SICIs match the corresponding non-SciSearch SICIs.
DocIDs
Documents in the Advance corpi are identified uniquely by an eight-digit
DocId, beginning at 10,000,000, arbitrarily assigned. DocIds can be repeated
among corpi, meaning that the combination <Corpus,DocId> actually uniquely
identies a particular document within a corpus. Furthermore, the same document
contained in different corpuses is likely to have different DocIds in each
corpus.
Citation ID
SciSearch has its own internal identifiers, with the syntax:
lname initials # yyyy # journal_char vol
#
page
where:
-
lname is the last name of the first author.
-
initials are the initials of the first author.
-
yyyy is the publication year.
-
journal_char is the first character of the journal title.
-
vol is the volume.
-
page is the starting page number.
CitationIDs are used by SciSearch to match citations against other entries
available in SciSearch.
URLs
URLs are unique identifiers within the EJournals.
When the EJournals are loaded, a "SICI database" is updated, holding
<SICI,URL> pairs, with six entries for each URL.
URL information for journals which are available remotely to the LWW
is also available. A table holds publisher-specific URL templates which
can be instantiated in predictable ways using partial SICI information
(e.g. ISSN and enumeration).
Both of these tables are made available to Advance, and then sent to
SciSearch.
Document Reference Crossover
There is no single, consistent, and accurate reference available across
all of the LWW corpi. The following table shows available paths to the
different corpi, given a starting point in the left-most column.
| Reference |
Advance |
SciSearch |
EJournals |
| Non-SciSearch SICI |
Lookup OK. |
Partial match through SciSearch SICI. |
Lookup OK through SICI database. |
| SciSearch SICI |
Partial match through non-SciSearch SICI. |
Lookup OK. |
Partial match through SICI database. |
| DocID |
Lookup OK. |
Partial match through non-SciSearch SICI. |
Lookup OK through non-SciSearch SICI, then SICI database |
| CitationID |
Partial match through SciSearch SICI. |
Lookup OK. |
Partial match through SciSearch SICI, then SICI database. |
| URL |
Lookup OK through SICI database. |
Partial match through SICI database. |
Lookup OK. |
Reference paths through LWW corpi
Note that complete paths would be available if SciSearch and non-SciSearch
SICIs could be made consistent.
Engineering Issues
ARP Read/Write Access
ARP programs will require at least read access to multiple LWW corpi. Whether
the ARP will require write access to LWW corpi will depend on whether LWW
front-end applications (e.g. the SciSearch pages) are modified to support
ARP information, or whether ARP applications (e.g. the TalkMine testbed)
will provide a self-contained user environment. The first option saves
ARP development over LWW, and vice versa.
Read Access
All LWW database formats will allow read access with varying degrees of
difficulty.
-
Advance: Read access in Advance is possible, but would likely be
quite difficult, involving either interfacing with the WebZ server or some
other portion of Advance, or with the untested and unproven SQL interface
available through Universe.
-
SciSearch: Access to SciSearch information is available either through
the Verity API calls to the Verity collections, or directly to the text
files. Verity querying will be necessary to retrieve full citation information.
It is questionable whether massive extraction from the Collections will
affect performance inordinately
-
MPS: Information about EJournals necessary for the ARP is best retrieved
through the text TOC files associated with each loaded CD.
Write Access
One consideration for ARP development is the ability to work in the native
LWW database format. It is difficult to make a final determination at this
point if any of the database environments currently in use in the LWW will
be appropriate for use by the ARP, but indications are negative. In particular:
-
Advance: Again, interfacing directly to the PIC environment will
be difficult, and the SQL capability is unknown.
-
Verity: While we could establish distinct ARP Verity collections,
little if any interaction among collections in a server are supported.
Further, there is poor to no support for numeric data and no relational
capability.
-
ScienceServer: Little is known of this architecture, although its
user interface is quite similar to Verity.
It may be advisable for the ARP to maintain its own data server for write
access, but which has read capacity from LWW corpi.
Server Configuration and Loading
All major LWW systems are supported by UNIX servers. Most are Solaris 2.5
or 2.6, except for sciserver.lanl.gov (the ScienceServer) which
is running Linux. Most servers are in the Green network. The current host
of library.lanl.gov (the Advance corpi) will shortly be moving
into the Blue and becoming a development environment. It is a Sun Ultra
3000 Enterprise machine with 6 250 mHz processors running Solaris 2.6.
Further details are available at http://lib-www.lanl.gov/~mlbm/libnet/index.html,
currently in draft form.
Acknowledgements
I would like to extend my thanks to the LWW staff for their great cooperation
in preparing this report, in particular Miriam Blake, Doug Chafe, Frances
Knudson, Abe Lederman, and Mark Martinez.