General Information on Computer Searching
From ChemicalInformationSources
Chemical Information Sources Wiki
Introduction
The first step that most people take in a search for information is to use a free Internet search engine, such as Google. Other options are to search commercial databases: SciFinder Scholar, ingenta, and Academic Search Premier, and many others.
This chapter attempts to clarify some aspects of Boolean searching that may not be that obvious to you, such as proximity operators and nesting. In addition, we introduce the concept of truncation (masking) to expand the scope of a search.
It is important to recognize the different years of coverage of various databases and to be aware that most of the really authoritative chemistry database searching is still done in commercial databases through database vendors.
Database producers and database vendors make it possible to search files that are located outside our geographic area through the techniques of online database searching. The online database industry is now in its fifth decade, and many sophisticated search techniques have been developed during that period. In comparison, the search techniques found in Internet search engines might be considered rudimentary, but they are constantly improving.
Each vendor offers a range of databases, some of which are specific to a discipline (chemistry, physics, etc.). Others deal with mission-oriented problems such as energy or the environment, cutting across disciplines in their subject coverage. Once connected to a database vendor's system, it is possible to perform CROSS-DATABASE SEARCHES simultaneously in a number of related files (multi-file searching). There are also tools that libraries can purchase to perform "federated searching," the term currently used to describe searches using several databases at once. WebFeat is one such product to perform federated searches.
Computer-Readable Sources
There are databases corresponding to the different primary and secondary printed sources:
- BIBLIOGRAPHIC - provide a bibliography of documents, perhaps with abstracts, and increasingly with links to the full texts of the primary documents.
- NON-BIBLIOGRAPHIC - numeric, full-text, dictionary, and directory databases: provide actual answers to questions without having to consult another source.
In a sense, the Internet search engines have turned the entire Web into one giant database. However, it has been shown that no single Web search engine indexes everything on the Web. It is the usual case that only 1/3 or less of the publicly available pages are caught by any given search engine's robot that roams the Web looking for pages to index. And you should know that no robot makes the voyage around the Web to collect Web pages every day. It might take months for a robot to complete its journey through the desired Web pages. Thus, no search engine's results are ever totally up to date. That is true whether you try Google, Hotbot, Northern Light, Altavista, or any other search engine. Furthermore, the free Web search engines do not have access to library databases such as the Web OPACs that tell you the holdings of the libraries, nor can they access any of the commercial vendors' offerings. Nevertheless, the search engines are very powerful tools, and for certain types of questions, they can be very useful in a search for information. For example, many people, including chemists, maintain their own personal Web pages nowadays. For locating someone and perhaps finding a full or selective bibliography or a curriculum vitae (CV) of a chemist, the Web may offer the best route to reliable, up-to-date information. Likewise, very new or hot topics may be discussed in Web news groups, discussion lists, or blogs long before they appear in traditional journals and, later, in abstracting and indexing services. For all of these reasons, we are beginning to see the commercial vendors add options to transfer the search strategy used in a commercial database search to the Internet for further information. One example is Elsevier Science Direct's Scirus, which searches both Elsevier journals and the Web. Another is STN's eScience. eScience requires you to first search a commercial STN database before taking your search to the Web.
In spite of the ease of accessing the Web, it ought to be a fairly rare case that you begin a subject search for information with a Web search engine if you have easy access to online commercial databases in your organization. Databases such as the Web of Science (including Science Citation Index potentially all the way back to 1900), Elsevier MDL's databases Gmelin and Beilstein cover the literature of modern inorganic, organic, and organometallic chemistry back to their beginnings in the 18th and 19th centuries), and Chemical Abstracts (which covers all areas of chemistry in a comprehensive manner back to 1907) are usually much better first choices, if they are available to you.
Options for Database Searching
The options for database searching include:
- ONLINE SEARCHING of commercial databases outside the geographic boundaries of the organization where the search is performed.
VENDORS of online search services (for example, STN International) lease or acquire databases from the database PRODUCERS (such as Chemical Abstracts Service or the Institute for Scientific Information) and make them available on remote computers. For a given vendor, which may have dozens or hundreds of databases on its computers, the databases are all searched by a common command language or graphical user interface. In the vast majority of these cases, there is a fee for searching the databases.
- WEB SEARCH ENGINES.
As noted above, the powerful search engines of today can provide a useful supplement to traditional online searches. A useful guide to search engines is maintained on the Search Engine Watch Web Site.
- FREE DATABASES ON THE WEB.
Some databases that are available for searching free on the Internat are of very high quality, for example, those produced by the National Library of Medicine or other government agencies or commercial organizations. However, the quality of most databases that are freely accessible on the Internet is likely to not be as high as that of commercial databases. In addition, there are many differences in the search interfaces that the user encounters among free Internet databases. Nevertheless, they should not be ignored for certain types of searches.
- IN-HOUSE SEARCHING of databases within the organization.
Chemical and pharmaceutical companies now routinely load databases on their own computers.
Costs and Benefits of Online Searching
The costs of a commercial online search are usually not fixed, but are dependent on several factors, including telecommunications network charges (even a connection via the Internet is not free on a commercial system), connect time on the vendor's computer, royalties charged for the information extracted from the database (known as HIT CHARGES), and on some systems, charges for the search terms input in the search strategy.
The benefits of using an online vendor to search databases include:
- Command language is uniform across all databases on that vendor's system. (Unfortunately, there is little movement toward adoption of a Common Command Language among vendors.)
- More years of the database are searchable than with other formats of the database, such as CD-ROM, and those years can be searched simultaneously.
- Trained Help Desk personnel will assist you when problems arise. STN's Help Desk: 1-800-753-4227 or help@cas.org.
STN International is at present the only online vendor to have available the abstract data from Chemical Abstracts. The abstract's summary of the document provides a quick way to assess whether the document itself should be read for further information. See the examples of journal article and patent abstracts (labeled "AB") in the STN CA File Quick Reference Card. The card also shows examples of the Messenger search commands that must be used on the STN system when searching the CA database, with over 30,000,000 bibliographic records, in native command mode.
STN is the only vendor on which you can perform structure searches of the CAS Registry File. A subset of the CA database is the CA Student Edition on the OCLC FirstSearch system.
Boolean Search Operators
Online search systems offer BOOLEAN SEARCH OPERATORS that show the logical relationship among different concepts. See "Operators for Relating Search Terms" for some examples of Boolean search operators on the STN system.
The most common Boolean operators are:
- OR - Concepts linked with the OR operator are synonymous or related in some fashion.
The aim is to broaden the scope of the search when the OR operator is used by including acronyms, abbreviations, and similar terms that may be used in the indexing of the documents in the database. One document in the answer set may contain only one of the terms, a different document may have another one, and a third may contain two, three, or all of the terms in an OR statement. The OR Boolean operator puts all of these into the final answer set.
The normal use of the English word "or" implies a choice, with only one thing possible in the final selection. In a Boolean sense, OR really grabs all of the items and puts them into a set. A special variant of the OR operator is XOR. XOR retrieves a document only if one of the terms in the OR statement is present, but would skip any documents that have both terms.
Example: pie OR cake
If each of the pieces of pie and cake in a bakery were placed on its own plate and arranged on an enormous tray, we would satisfy the search (pie OR cake), and the tray would represent our answer set. Since the XOR operator was not used, there could even be some plates on which both pie and cake were found.
- AND - Different concepts are combined with the AND operator to insure that both are found in the same document(s).
In conversational English, "and" is used to group things that may or may not be similar. In a Boolean search, all terms connected with the AND operator must appear in each document in the answer set.
Example: cake AND ice cream
In this example, think of each of the pieces of cake as having to be on its own plate with some ice cream on top in order to satisfy the search.
- NOT - A concept is excluded from the final answer set with the use of the NOT operator.
Example: (cake AND ice cream) NOT chocolate
Example: (pie OR cake) NOT chocolate
Let's assume that you are allergic to chocolate. What would happen in the NOT examples if chocolate cake were the only type of cake available? In the first case, you would not get any dessert because the NOT completely eliminates the subset when one of the terms satisfies it. It throws out each of the plates containing the chocolate cake even if the ice cream is your favorite, vanilla. In the second NOT case, however, our search would allow us to have a piece of pie (as long as it wasn't chocolate pie or the plate didn't also have some chocolate cake on it!).
The NOT command must be used with caution in online searching since it could eliminate some documents that are of interest if they also happen to discuss aspects of a topic that are not of interest. There are more specific variants of the AND command that can be used to define the spatial relationships of search terms. These are called POSITIONAL or PROXIMITY OPERATORS. On STN, they are:
- (A) - terms must be adjacent without regard to order*
- (W) - terms must be in the order specified*
- (L) - terms must occur in the same logical unit (field)
- (S) - terms must be in the same sentence within the same field.
Note that on STN the (A) and (W) operators mean the same in all files; other proximity operators may yield different results depending on the file.
STN assumes that multi-word phrases are to be searched using the (W) operator in the absence of explicit positional or other Boolean operators.
Truncation (Masking) of Characters to Expand a Search
In many cases where subject searches are concerned, we are looking for topics that involve words built on a common root word, or that have some other variations that are easily signaled to a computer by means of a special symbol. TRUNCATION is the technique that tells the computer to form an answer set consisting of all records that contain words with the characters input for the search, but could also contain related words with suffixes (or, in some cases, prefixes) or variable characters at a given point in the word. It is NOT possible to use the truncation technique on SciFinder Scholar research topic searches. However, it can be applied on command-driven searches such as those done on STN. For examples, see:
Truncation can occur at the left end or the right end of a word stem or within the word. STN now allows all three types of truncation in the CA File Basic Index, an index of subject words from the title words, words in the abstracts, or index terms (including Registry Numbers for compounds discussed in the documents). The limit of terms that can be gathered in a set by truncation is 30,000 stems. For left truncation the search term must have at least four characters.
Novice searchers and even professionals sometimes make gross errors with truncation, especially in systems that allow both left- and right-hand truncation. Think what would happen if a search were run with these character strings truncated on both sides:
Every occurrence of the word "chemical" or "chemistry" or "biochemical," etc. would be pulled in the first search, but also documents containing words such as "hemisphere". In the second case, every document that contains an English word that ends in -ION would be pulled. Probably not what the searcher would have wanted!
On the STN system, truncation symbols are:
| Symbol | Function | Example |
|---|---|---|
| exclamation point (!) | Exactly one character | cataly!e |
| hash mark (#) | One or no character | alcohol# |
| question mark (?) | Any number of characters | ?therap? |
As noted in the table, the # sign can be used at the end of a word to pick up both singular and plural forms of a word. Another way of accomplishing the same thing on STN using the command language option is to enter SET PLURALS ON at the system prompt. Both left- and right-hand truncations are allowed with the "?".
There are limits to the number of terms that can be gathered into a set using truncation. Therefore, caution must be exercised in using truncation to prevent too many search terms (or unexpected words) from entering the answer set.
Unfortunately, there is no uniformity of symbols used to designate truncation among different vendors or search engines, although often we find an asterisk (*) used to indicate the right-hand truncation point. That is the case with the Web of Science, for example.
With SciFinder Scholar, no truncation is used. The searcher simply types into the Research Topic search window the natural language expression that defines the search, without even trying to insert Boolean search terms. The SciFinder Scholar search algorithm has some built-in intelligence to look for relevant word forms for the search. For instance, the search system automatically searches for both singular and plural subject words.
Let's see an example of a search on SciFinder Scholar for the analytical technique "Electron Spectroscopy for Chemical Analysis (ESCA)," including results from both the CAPlus and Medline databases.
At the time it was run, the search as entered found 4395 references where the two concepts "electron spectroscopy" and "chemical analysis" were closely associated with each other and only 582 where the phrase as entered was found. In this case, let's repeat the search using the acronym for the analytical technique (ESCA) and also use a synonomous acronym, XPS. (The technique is also known as X-Ray Photoelectron Spectroscopy.) We have the option of entering synonomous words in parentheses, following a term or phrase. Thus, entering the research topic search on SciFinder Scholar as:
XPS (ESCA)
would imply to the system that you are looking for synonymous terms (an OR search). This search found considerably more documents: 114,511 at the time of the search on October 3, 2004. However, many of the 35,609 records pulled by the ESCA part of the search were false drops that match the word "escape"! Entering ESCA by itself pulls 7516 records with the term "as entered," and it appears that all but the oldest (a 1918 record) are relevant. Thus, the technique of entering synonyms in parentheses must be used with caution on SFS.
The CA, Registry, and Other CAS-Produced Files on STN: CAS Databases
Chemical Abstracts is the largest and most nearly comprehensive abstracting service for information in chemistry. It covers a very broad range of topics and has been published since 1907. At present Chemical Abstracts Service creates three main files and several related databases. These include the CA File of literature that extends back to 1907 and the CAOLD file that at present covers the period 1907-66. The Registry File contains searchable information that leads to the rapid identification of a compound, when a name, molecular structure, or other pertinent data is known about it. The Registry File also links these substances to the information that is indexed in the CA File and other chemical databases on the STN system through the Registry Numbers assigned by Chemical Abstracts Service to chemical substances. The CAS REGISTRY NUMBER is a unique number assigned to each chemical substance in the Registry File. For isatin, it is 91-56-5. To accommodate the continuing growth of substance information in the Registry file, CAS will begin to assign 10-digit CAS Registry Number (CAS RN) identifiers for newly registered substances in mid-January 2008.
Also produced by CAS are the CASREACT file of organic reaction data, the CHEMCATS file that links chemical substances with commercial suppliers, the CHEMLIST file of regulatory data, a special variant of the CA File, CAPlus, that offers rapid coverage of the articles in the main journals of chemistry, and the MARPAT file that facilitates retrieval of structures covered in patents through a technique called Markush searching.
The CA File covers chemical literature found in journals, patents, patent families, technical reports, books, conference proceedings, and dissertations from all areas of chemistry, biochemistry, chemical engineering, and related sciences from 1907 to the present. The CAplus file is a special version of the CA File that even has records for about 600 articles published before 1907. Since October, 1994 it contains all articles from more than 1,500 key chemical journals, including records for document types not covered in Chemical Abstracts (CA): biographical items, book reviews, editorials, errata, letters to the editor, news announcements, product reviews, meeting abstracts, and miscellaneous items. Bibliographic information and abstracts for the articles from the key chemical journals are added within one week of journal receipt. Both the CA and CAplus files were retrospectively converted to include earlier information. By the end of 2002, all CA bibliographic data that appeared in the printed Chemical Abstracts was included in the CA and CAplus files.
There are low-cost learning files that correspond to:
- CA File, the bibliographic file that now has over 30,000,000 records dating from 1907 to the present. It includes full indexing and abstracting of the original documents. Examples are found in the LCA database summary sheet.
- Registry File, the file containing information on over 70,000,000 substances, including the CAS Registry Number, the CAS Index Name, other Chemical Names, Molecular Formula, and structural depiction for each substance. (Examples of the Learning Registry File searches and records are found in the LREG database summary sheet.)
SciFinder and Other Front-end Software and WWW Access
Learning the command language of STN Interntional, DIALOG, or other vendors can be a significant barrier to online searching for some. There are programs that can help the novice searcher. One such FRONT-END program is STN Express with Discover. Questel's IMAGINATION software is another front-end software packages.
The most recent efforts by the major vendors to win online searchers have been directed toward the Internet. For example, STN EASY allows direct access to the STN databases with a relatively straightforward graphical user interface. Most recently, STN has developed for professional searchers STN on the Web. The U.S. National Library of Medicine's PubMed gives free and easy access to a version of the National Library of Medicine's main database, Medline.
Another CAS product is SciFinder and its academic counterpart, Scifinder Scholar, which make the searching of some of the CAS databases (CAplus, Registry, CHEMLIST, CHEMCATS, and CASREACT) relatively effortless. It lets the user perform chemical searches by clicking on the icons depicted below.
(Reproduced with permission of CAS, a division of the American Chemical Society.)
With the 2007 revision of SciFinder and SciFinder Scholar, the capability to combine answers was introduced. Prior to this, answer sets could be reduced only by using the "analyze" or "refine" options built into the tool. With the 2007 revision, the option to "combine" (OR), "intersect" (AND), or "remove" (NOT) answer sets extends the control that the user has over the search process. The revision also permits:
- Export of commercial chemical records from CHEMCATS into Excel
- Printing of structures in thumbnail display format
- Options related to journal titles that interface with the bibliographic software packages Reference Manager, EndNote, and ProCite.
Formats: Document Types
In the printed "Chemical Abstracts," a B or P immediately before an abstract number designates a book or a patent respectively. In the online CA file, these and other documents are found in the Document Type (DT) field of the CA File:
| Code | Document Type |
|---|---|
| B | Book |
| C | Conference proceedings |
| D | Dissertation |
| GR | General review |
| J | Journal article |
| P | Patent |
| R | Review |
| T | Technical report |
(Reproduced with permission of CAS, a division of the American Chemical Society.)
Thus, combining an answer set number with one or more codes or words can either limit the answer set to a particular document type (or perhaps eliminate an unwanted type), e.g.,
=> S L4 NOT P/DT
or
=> S L4 AND J/DT
Eight new document types (biography, book review, editorial, errata, letter, miscellaneous, news announcement, and product review) were introduced to the CAPlus file in 1994. SciFinder Scholar allows you to refine the answer set by many parameters, among them the document types shown below.
(Reproduced with permission of CAS, a division of the American Chemical Society.)
On the Web of Science, General Searches can be limited to:
- Article
- Abstract of published item
- Bibliography
- Biographical item
- Correction
- Correction, Addition
- Letter
- Meeting Abstract
- News item
- Review
and several other choices.
Other Ways to Refine a Search on SciFinder Scholar
SciFinder Scholar searches can be refined by many other options, as seen below.
(Reproduced with permission of CAS, a division of the American Chemical Society.)
Similar refinements are possible with Web of Science and other database searches.
Link to Internet Sources for General Information on Computer Searching
This wiki page was originally created by Gary Wiggins. If you have a legitimate desire to contribute to its contents, please request an account from the sysop, Dr. David J. Wild, by e-mailing him at djwild @ indiana.edu

