Feedback on PGTEIProject Gutenberg ("PG") is The following feedback is intended to be constructive. I think PG should move towards an XML "master" document as soon as possible. I'm not convinced that TEI is the most appropriate XML vocabulary; I hope the following feedback either helps improve it or replace it. I would guess that much of Marcello's work can be modified for any XML vocabulary. My feedback includes many minor issues and a few major issues. If there's any confusion over which is which, just ask! PGTEI VocabularySection 18: I strongly recommend omitting the requirement that TEX and NROFF characters must be escaped. (As far as I can tell, that's not part of TEI.) It may well be a useful optional feature; perhaps it could be turned on by including a specific processing instruction. Consistent with the separation of structure and format, the
PGTEI ExamplesHaving separate In fact, the tag itself seems redundant. Shouldn't the It would be very useful to label every alice.tei: alice.tei and lmiss.tei are both missing PGTEI DocumentationKudos for actually writing documentation, and for numbering the sections to match the TEI Lite docs! It would be nice to add hyperlinks that point to the TEI Lite docs (preferably the Section 6.3: Information on Section 7: The paragraph beginning "In the PDF format it does not insert a footnote marker in the text" wasn't clear to me, and the example didn't help. What is it trying to say? Is the example correct as shown? Generated HTML, PDF and TextIn the PG license, section numbers such as "1.A." should appear on the same line as the text that follows -- per the original and to avoid wasting space. Generated HTMLThere appear to be two validation errors, e.g. in the
In the documentation, why is "Versprich mir, Heinrich" repeated in the output, the second time in white? In the "Faust" example, I would prefer much less whitespace after the speaker labels (though I did not check the text original, much less look for an original scan). With the caveat that I'm not a CSS expert, I believe that the following suggestions follow the spirit of using HTML for structural markup and CSS for format:
Default CSSThe lack of space between paragraphs goes against Web conventions. (It's fine as an option but a poor choice for the default.) The filename ("de-gnutenberg-press-1.0-persistent.css") is too long for Macintosh OS 9 and below (38 chars vs. the max of 31). Possibly relevant: did Windows jump from 8.3 filenames directly to "long" filenames, or was there an intermediate limit? Generated PDFIt would be useful to have at least a few user-selectable parameters, e.g. page size. For the default page size, a width of 5.5" rather than 5.83" would accommodate US Letter as well as A4 paper. (Incidentally, using quote marks for inches is a heuristic that should be checked when converting text to XML.) The default font size seems overly large for printing, though I didn't compare it to a representative sample of books. The spacing between sections of the PG footer seems excessive. Generated TextIt would be useful to generate a form that can be easily compared to the original PG text, even if that's an option rather than the default. For example: don't rewrap lines! The format didn't appear to match PG standards (e.g. "underlining" the chapter titles using various characters) -- though I know very little about PG standards so I may well be wrong. PGText to PGTEIAgain, kudos for including this. Semi-automated conversion is an important part of migrating legacy PG texts to XML. Are the heuristics from things like GutenMark included? That would seem to be quite valuable. Please do not rewrap lines! That makes it very difficult to compare different versions, editions, formats, markup techniques, etc. Try to identify the table of contents. If deleting it is too drastic, just add a comment such as "If this is the table of contents, delete it. It will be generated automatically by divGen type=toc." Why? Because I suspect (hope?) that many people who review the generated TEI will be beginners. If an Introduction is properly part of front matter, try to identify it and place the In addition to processing the source text, include information from the database, e.g. LoC Class, Subject, Alternate Title.
If langUsage is intended to describe the language in the current document, perhaps it's worth adding a comment such as "delete any languages not present in this etext". Rather than Spaces before closing p and head tags are preserved. Is that intentional? When converting The Wonderful Wizard of Oz, several dozen instances of the following markup were incorrectly added (in every case, the text didn't appear significantly different than surrounding paragraphs):
This line was incorrectly marked as a
Posted Oct. 28, 2004
Classicosm is a Product Architect site.
|
|