tokenized-to-tree

Perform text analysis and patch the results back into the source XML

Repository
Git URL https://github.com/transpect/tokenized-to-tree.git
SVN URL https://github.com/transpect/tokenized-to-tree
Base URI http://transpect.io/tokenized-to-tree/

Source ⬇

ttt:line-finder

Import

<p:import href="http://transpect.io/tokenized-to-tree/line-finder-module/xpl/line-finder.xpl"/>

Dependencies

Synopsis

<ttt:line-finder xmlns:ttt="http://transpect.io/tokenized-to-tree">
  <p:input port="lines" primary="true"/>
  <p:input port="ttt-paras"/>
  <p:input port="stylesheet"/>
  <p:output port="result" primary="true"/>
  <p:option name="ignore-matched-lines" required="false" select="'false'"/>
  <p:option name="debug" required="false" select="'no'"/>
  <p:option name="debug-dir-uri" required="false" select="'debug'"/>
</ttt:line-finder>

ppp:postprocess-poppler

Import

<p:import href="http://transpect.io/tokenized-to-tree/postprocess-poppler-module/xpl/postprocess-poppler.xpl"/>

Dependencies

Synopsis

<ppp:postprocess-poppler xmlns:ppp="http://transpect.io/postprocess-poppler">
  <p:input port="source" primary="true"/>
  <p:input port="stylesheet"/>
  <p:input port="param-doc"/>
  <p:output port="result" primary="true"/>
  <p:option name="debug" required="false" select="'no'"/>
  <p:option name="debug-dir-uri" required="false" select="'debug'"/>
</ppp:postprocess-poppler>

ttt:prepare-input

Ignorable, Normalized and Generated Content

Content that should be ignored for string counting purposes will receive a role="ttt:placeholder" attribute.

If the content is not an element, it will be wrapped in one of the following special elements:

ttt:comment
Contains a comment that will be invisible to the string-length counting process but will be re-inserted after token tagging has been inserted.
ttt:ignorable-text
Contains text that will be invisible to the string-length counting process but will be re-inserted after token tagging has been inserted.
ttt:pi
Contains a processing instruction that will be invisible to the string-length counting process but will be re-inserted after token tagging has been inserted.

These placeholder elements will then be emptied for the document that is output on the result port. The full elements will be retained in the document that is output on the with-ids port.

After the result document has been processed (tokenization, analysis, creating token markup and attaching the analysis results to the marked-up tokens), the empty placeholder elements will be replaced with their original content.

For this to work, all placeholder elements need to have IDs. A generated xml:id attribute starting with 'NOID_' will be added if the future placeholder element does not have an ID yet. The 'NOID_' IDs will be removed during the merging step, ttt:merge-results (file ttt-5-expand-placeholders.xpl).

There is an additional type of ttt:placeholder element:

ttt:normalized-space
An element that contains exactly one space character, plus an @original attribute. For counting and analysis purposes, it will look like an ordinary single space character. The original space-like content will be restored from the @original attribute during the merging phase.

Finally, there is the opposite case of ignored content, which is generated content. Scenario: In a TEI critical apparatus, the text-critical notes are rendered in a distinct section. Each note is preceded by the page and line(s) that it applies to, plus some words of context. The context will be rendered via the note’s @target attribute. If the task is to retrofit PDF line breaks into the source, the PDF lines must be matched against the notes. This can be done either by inserting a regex that would match any page/line numbers, followed by any context phrase, or by at least including the context phrase in the note element so that it will match the PDF line (modulo page/line numbers). Then we need to temporarily augment the note with the context phrase, but we need to make sure that it will be removed after the line breaks have been marked up.

For this purpose, an element

ttt:generated
Mostly useful for line number retrofitting applications. Contains text that is not present at this location in the source XML. It can be a placeholder for a page number or for text from another place in the text, for example from a lemma that is repeated at the beginning of a note.

Side-by-Side Format

Tokenization Markup

The tokens will be marked up with ttt:token and ttt:space elements. These elements cover the whole unexpanded paragraph.

Import

<p:import href="http://transpect.io/tokenized-to-tree/xpl/ttt-1-prepare-input.xpl"/>

Dependencies

Synopsis

<ttt:prepare-input xmlns:ttt="http://transpect.io/tokenized-to-tree">
  <p:input port="source" primary="true"/>
  <p:input port="stylesheet"/>
  <p:output port="result" primary="true"/>
  <p:output port="with-ids"/>
  <p:option name="map-higher-unicode-planes" select="'no'"/>
  <p:option name="debug" required="false" select="'no'"/>
  <p:option name="debug-dir-uri" required="false" select="'debug'"/>
</ttt:prepare-input>

ttt:process-paras

Import

<p:import href="http://transpect.io/tokenized-to-tree/xpl/ttt-3-integrate-tokenizer-results.xpl"/>

Dependencies

Synopsis

<ttt:process-paras xmlns:ttt="http://transpect.io/tokenized-to-tree">
  <p:input port="source" primary="true"/>
  <p:input port="patch-token-stylesheet"/>
  <p:input port="rng-schema"/>
  <p:output port="result" primary="true"/>
  <p:option name="debug" required="false" select="'no'"/>
  <p:option name="debug-dir-uri" required="false" select="'debug'"/>
  <p:option name="milestones-only" select="'no'"/>
  <p:option name="map-higher-unicode-planes" select="'no'"/>
</ttt:process-paras>

ttt:merge-results

Import

<p:import href="http://transpect.io/tokenized-to-tree/xpl/ttt-5-expand-placeholders.xpl"/>

Dependencies

Synopsis

<ttt:merge-results xmlns:ttt="http://transpect.io/tokenized-to-tree">
  <p:input port="source" primary="false"/>
  <p:input port="patched-paras" primary="true"/>
  <p:input port="stylesheet"/>
  <p:input port="params"/>
  <p:output port="result" primary="true"/>
  <p:option name="debug" required="false" select="'no'"/>
  <p:option name="debug-dir-uri" required="false" select="'debug'"/>
</ttt:merge-results>

GitHub sync date: 2025-01-08+01:00