PG Data from The Online Books Page
John Mark Ockerbloom maintains The Online Books Page and has kindly made his enhanced PG data (gzip format, 308K) available under the Creative Commons Attribution, NonCommercial license. The following are my notes on the file; any errors and omissions are mine. The format is simple: a field name followed by one or more spaces then the data, ending with a newline (linefeed). Most fields can be repeated; details below. Each record is separated with an extra newline.
File Format
The data file contains roughly 8231 PG records with complete information (plus a few additional entries with decimal IDs), and another 1000 or so records that only contain the FMT field (and ID).
| Label | Description | Count* |
| Main |
| | NUMBER | Project Gutenberg ID (plus a few non-PG with decimal ID) | 9260 |
| TITLE | includes embedded subtitle, volume, issue, language, etc. | 8223 |
| Person |
| AUTHOR | format: Last, First, optional prefix/infix/suffix, optional birth/death dates -- may include HTML entities for ISO chars | 7708 |
| CONTRIBUTOR | may cover many authors in a collection (e.g. #1980) but NOT co-authors who work together | 110 |
| EDITOR | | 530 |
| ILLUSTRATOR | | 63 |
| TRANSLATOR | | 891 |
| Other |
| EREF | External reference; format: TEXT_for_link URL | 2 |
| FMT | file download information | 2625 |
| GREF | Gutenberg folder reference; "NEW" is the current numeric system; older is something like "etext98/ozvrs10" | 9003 |
| LCCN | Library of Congress Control Number | 2 |
| NOTE | only one occurance, the info is not shown at http://onlinebooks.library.upenn.edu/webbin/gutbook/lookup?num=5159 | 1 |
| PREF | components; format: GutenbergID TEXT-for-link (typically the title or a descriptive subset that's meaningful in this context) | 256 |
| SERIAL | format: IssueNumber TITLE; facilitates linking multiple issues of magazines and such | 172 |
| SREF | See also reference; format: GutenbergID TEXT-for-link | 170 |
| # | comment character | 1 |
The fields are approximately in the following order, though with some variation among records: GREF, EREF, SERIAL, SREF, AUTHOR, ILLUSTRATOR, TRANSLATOR, EDITOR, CONTRIBUTOR, TITLE, PREF, LCCN, NOTE, FMT, NUMBER. (See my suggestion on this below.)
The following fields have multiple values in the indicated number of records:
| Label | Count* |
| | AUTHOR | 147 |
| CONTRIBUTOR | 25 |
| EDITOR | 39 |
| FMT | 4 |
| GREF | 1 |
| PREF | 256 |
| SREF | 4 |
| TRANSLATOR | 90 |
* Note that the counts may not be exactly right with the latest file, though I still find them helpful in understanding the data.
My Suggestions
- Make sure the data fields are in the same order for every record. That will make it easier to DIFF against data exported from another source.
- Add a distinct SUBTITLE field, e.g. as done in PG's GUTINDEX and in Classicosm's PG metadata.
- Add a NOTE field.
Updated Oct. 28, 2004
Classicosm is a Product Architect site.
classicosm -at- product architect -dot- com (Feedback welcome!)
Copyright 2004 by Scott S. Lawton. All Rights Reserved. "Classicosm" and "A world of timeless value" are service marks owned by Scott S. Lawton.
| |
|