File Formats

From CrossWire Bible Society

(Difference between revisions)
Jump to: navigation, search
(IMP: ==== imp2osis ====)
m (imp2osis: CrossWire has a Perl script)
Line 50: Line 50:
==== imp2osis ====
==== imp2osis ====
-
There is a CrossWire tool called [http://crosswire.org/ftpmirror/pub/sword/utils/perl/imp2osis.pl imp2osis.pl], which will convert IMP to OSIS fairly well (except a few 'corner cases'). Whenever CrossWire receives an IMP submission, this is the first thing that is run, allowing CrossWire to do validation and other OSIS sanity checks. Some editing is usually necessary after converting an IMP file to an OSIS XML file. For example, the attribute '''canonical''' is omitted from the osisText and all <div> elements, and the language attribute '''xml:lang''' defaults to "en".
+
CrossWire has a Perl script called [http://crosswire.org/ftpmirror/pub/sword/utils/perl/imp2osis.pl imp2osis.pl], which will convert IMP to OSIS fairly well (except a few 'corner cases'). Whenever CrossWire receives an IMP submission, this is the first thing that is run, allowing CrossWire to do validation and other OSIS sanity checks. Some editing is usually necessary after converting an IMP file to an OSIS XML file. For example, the attribute '''canonical''' is omitted from the osisText and all <div> elements, and the language attribute '''xml:lang''' defaults to "en".
==The SWORD Project Utilities==
==The SWORD Project Utilities==

Revision as of 13:29, 31 December 2012

Bible study programs use a plethora of markup formats. Even more have been suggested for use in creating Bibles and other religious material.

CrossWire Bible Society respects copyright. As such, conversion of material that is under copyright without permission from the copyright holders is not supported by The SWORD Project.

This page lists some of the more common file formats relevant to The SWORD Project, associated utilities, and other CrossWire projects.

Contents

SWORD Input formats

The SWORD Project supports the following markup: OSIS, ThML, GBF and plain text.

OSIS

Open Scripture Information Standard

The Open Scripture Information Standard (OSIS) is "a common format for many visions." It is an XML format for marking up scripture and related text, part of an initiative composed of translators, publishers, scholars, software manufacturers, and technical experts, coordinated by the Bible Technologies Group. It is co-sponsored by the American Bible Society and the Society of Biblical Literature.

The most recent XML schema is OSIS 2.1.1, and a manual is also available. There are some examples of OSIS files at Bibles in OSIS 2.0.

This markup format is recommended by the CrossWire Bible Society and can be used for creating all types of resources for The SWORD Project. Support for OSIS is actively maintained and support for any unsupported elements or features needed for a module you may be working on may be requested.

Prince XML is a proprietary software program that converts XML and HTML documents into PDF files by applying Cascading Style Sheets (CSS). It is developed by YesLogic, a small company based in Melbourne, Australia. It can be used to create high quality PDF Bibles from OSIS files[1]. A paper by Jim Albright of Wycliffe Bible Translators was presented at BibleTech 2010 on using the open-source GUI companion for Prince XML, called Princess.

ThML

Theological Markup Language

This format is a variant of XML based on TEI and ThML, developed by and for the Christian Classics Ethereal Library. The specifications for this markup format are available at http://www.ccel.org/ThML/.

This markup format is used in some SWORD resources, but only the creation of free-form "General book" modules based on existing CCEL resources is currently supported. Other works and new works should be created using the OSIS or TEI format.

GBF

General Bible Format

This markup format is intended as an aid to preparing Bible texts (specifically the WEB and WEB:ME) for use with various Bible search programs. The complete specification is at http://www.ebible.org/bible/gbf.htm.

This markup format was previously used for some SWORD modules but is now deprecated in favor of OSIS. The rudimentary gbf2osis.pl Perl utility may be used to convert GBF to OSIS for import to SWORD's native format. Adyeth hosts a gbf2osis Python utility that he wrote to convert the GBF texts from ebible.org to OSIS. See [2].

VPL

Verse-Per-Line

This plain-text format is used for by SWORD for import of Bibles. It consists of one verse per line, with an optional verse reference at the beginning. The vpl2mod utility may be used for import. VPL is deprecated in favor of the IMP format, which is more widely useful. The mod2vpl utility may be used for export to VPL. There is a command line switch to prepend the verse reference to each line.

IMP

Import Format

This proprietary file format is used by SWORD for import of all types of modules. The three utilities imp2vs (for Bibles and verse-indexed commentaries), imp2ld (for lexicons, dictionaries, and daily-devotionals), and imp2gbs (for all other types of books) can be used to import IMP files to SWORD's native formats.

An IMP file consists of any number of entries. Each entry consists of a key line and any number of content lines. The key line consists of a line beginning with "$$$". For example, "$$$Gen 1:1" would be the key line for the Genesis 1:1 entry of a Bible or commentary module.

The content lines of an entry may consist of any text (provided that the first three characters of the line are not "$$$"). The internal markup of the content may be in any format supported by SWORD, namely OSIS for any module type or ThML for freeform books from CCEL.

See also DevTools:Modules#IMP_Format.

imp2osis

CrossWire has a Perl script called imp2osis.pl, which will convert IMP to OSIS fairly well (except a few 'corner cases'). Whenever CrossWire receives an IMP submission, this is the first thing that is run, allowing CrossWire to do validation and other OSIS sanity checks. Some editing is usually necessary after converting an IMP file to an OSIS XML file. For example, the attribute canonical is omitted from the osisText and all <div> elements, and the language attribute xml:lang defaults to "en".

The SWORD Project Utilities

Precompiled versions of many of these programs are available in most Linux distributions, using the distribution's package installer. For Windows, they can be found here.[1]

Module Creation Tools

It is recommended that Unicode text files used for module creation be encoded as UTF-8.[2]

Diagnostic Tools

Conversion Tools

OSIS Utilities

Miscellaneous

Notes on SWORD Tools

  1. If you have Xiphos installed in Windows, the Sword utilities are available in the Xiphos\bin folder.
  2. EOLs should be either Unix style (LF) or Windows style (CRLF). Text files with Mac style EOLs (CR) may give rise to errors or other unexpected behaviour.
  3. The IMP file may contain a residue of XML markup
  4. The VPL file may contain a residue of XML markup

Recommended Non-SWORD Utilities

Formats for Which CrossWire Maintains Converters

The SWORD Project uses primary source e-texts. These texts come in numerous formats. CrossWire maintains converters for a number of formats, described below. The converters may target other markup formats, e.g. TEI or OSIS, or may simply export binary data to text, as is the case with our STEP exporter. Specific discussion of each of the available converters is found elsewhere on this page.

STEP

Standard Template for Electronic Publishing

This file format was formerly used by QuickVerse and WORDsearch, and is currently used for some e-Sword books.

While not an open standard, the publicly released documentation and specifications for this format can be found partially mirrored at http://www.crosswire.org/bsisg/. Some utilities for working with this format are listed below. It is unlikely that the SWORD Project will support this format in the future as it is largely dead.

Not to be confused with STEP (Scripture Tools for Every Pastor) – and the new front-end application (Tyndale STEP) being developed by Tyndale House, Cambridge in collaboration with CrossWire.

Unbound Bible Format

Unbound Bible Format

The BIOLA's Unbound Bible offers many of their resources for download in a proprietary, but relatively simple tab-delimited plain-text format (TDT). There are usually two variants, one with versification mapping to the ASV, and the other without verse mapping. All available downloads may be found on Unbound Bible's download page.

There is no widespread use of this format, but the rudimentary unb2osis.pl utility may be used to convert Unbound Bible format to OSIS for import to SWORD's native format.

USFM

Unified Standard Format Markers

This plain-text format is a common internal-use format within Bible translation agencies and Bible societies. It is the native format of ParaTExt. Paratext is used by more than 60% of all Bible translators world-wide. The current release is Paratext 7.3. Our own Perl script usfm2osis.pl may be used to convert USFM to OSIS for import to SWORD's native format. See Converting SFM Bibles to OSIS. USFM uses a separate file for each Bible book. USFM is also supported by the open-source program called Bibledit. There are examples of Bibles in USFM format available for download at [3]. These include the KJV, ASV, WEB, HNV and PNG Bibles.

USFM is one of the formats that can be used by Go Bible Creator.

USX

USX is an XML schema that will be the underlying data structure in the next release of UBS Paratext, which is in beta now. SIL's Language Software Development team is working along with UBS on this. This version of Paratext can take in USFM projects and export USX files.

USX was defined to support the Every Tribe Every Nation Digital Bible Library alliance. The alliance brings together the United Bible Society, SIL/WBT, American Bible Society and other Bible Agencies. Under the ETEN framework, Bible translations made publication ready in the DBL for access by approved End User Ministry Partners (EUMP) will be stored in USX format.

The USX schema is available in the compact Relax NG Schema language.

Zefania XML

Zefania is an XML format for Bible markup with only the most simple structural tags for book/chapter/verse, notes, etc. The project is now hosted on SourceForge. The Zefania Bible Reader may be used to display Zefania XML Bibles through XSL transformation in browsers. See also the related Bible Resources Archive.

The CrossWire utility zef2osis.pl may be used to convert Zefania XML to OSIS for import to SWORD's native format.

Go Bible

Following an agreement made in July 2008 with the program's author Jolon Faichney, Go Bible was adopted by CrossWire as its Java ME software project.

To achieve the navigation speed and general ease of use on even the simplest of Java mobile phones, Go Bible data is fully indexed, as well as being compressed (as are all JAR files). The format is described in Go Bible data format. Go Bible data is structured as Book | Chapter | Verse text and does not support notes, headings and cross-references, etc. The developer kit Go Bible Creator can take either USFM, ThML or OSIS as the source text format, but they usually have to be made specially suitable. For example, OSIS files produced by Snowfall Software's SFMToOSIS script are not structured the same. Work has begun to make an XSLT script to convert such OSIS XML files to the format suitable for Go Bible. Go Bible Creator version 2.3.2 and onwards can take a folder of USFM files as the source text format.

Go Bible source code is now available here on the CrossWire Repository. To access this you will need to have an account.

GoBibleDataFormat is being extended in the SymScroll branch.

Other Utilities

These are not part of The SWORD Project, but may be useful. A link is given for each.

Go Bible utilities

GBF Tools

STEP Utilities

ThML Utilities

Zefania Utilities

Optical Character Recognition

Text development activities may be greatly assisted by using OCR software. This section will list OCR programs that CrossWire volunteers have found useful. Proprietary programs should not be listed here, the preference at CrossWire being to use free and/or open-source software.

Tessaract

See also

Personal tools
Namespaces
Variants
Actions
Navigation
Miscellaneous
Toolbox