tokenized-to-tree
Perform text analysis and patch the results back into the source XML
Git URL | https://github.com/transpect/tokenized-to-tree.git |
SVN URL | https://github.com/transpect/tokenized-to-tree |
Base URI | http://transpect.io/tokenized-to-tree/ |
ttt:line-finder
Import
<p:import href="http://transpect.io/tokenized-to-tree/line-finder-module/xpl/line-finder.xpl"/>
Dependencies
Synopsis
<ttt:line-finder xmlns:ttt="http://transpect.io/tokenized-to-tree">
<p:input port="lines" primary="true"/>
<p:input port="ttt-paras"/>
<p:input port="stylesheet"/>
<p:output port="result" primary="true"/>
<p:option name="ignore-matched-lines" required="false" select="'false'"/>
<p:option name="debug" required="false" select="'no'"/>
<p:option name="debug-dir-uri" required="false" select="'debug'"/>
</ttt:line-finder>
ppp:postprocess-poppler
Import
<p:import href="http://transpect.io/tokenized-to-tree/postprocess-poppler-module/xpl/postprocess-poppler.xpl"/>
Dependencies
Synopsis
<ppp:postprocess-poppler xmlns:ppp="http://transpect.io/postprocess-poppler">
<p:input port="source" primary="true"/>
<p:input port="stylesheet"/>
<p:input port="param-doc"/>
<p:output port="result" primary="true"/>
<p:option name="debug" required="false" select="'no'"/>
<p:option name="debug-dir-uri" required="false" select="'debug'"/>
</ppp:postprocess-poppler>
ttt:prepare-input
Ignorable, Normalized and Generated Content
Content that should be ignored for string counting purposes will receive a role="ttt:placeholder"
attribute.
If the content is not an element, it will be wrapped in one of the following special elements:
- ttt:comment
- Contains a comment that will be invisible to the string-length counting process but will be re-inserted after token tagging has been inserted.
- ttt:ignorable-text
- Contains text that will be invisible to the string-length counting process but will be re-inserted after token tagging has been inserted.
- ttt:pi
- Contains a processing instruction that will be invisible to the string-length counting process but will be re-inserted after token tagging has been inserted.
These placeholder elements will then be emptied for the document that is output on
the result
port.
The full elements will be retained in the document that is output on the with-ids
port.
After the result document has been processed (tokenization, analysis, creating token markup and attaching the analysis results to the marked-up tokens), the empty placeholder elements will be replaced with their original content.
For this to work, all placeholder elements need to have IDs. A generated xml:id
attribute starting
with 'NOID_' will be added if the future placeholder element does not have an ID yet.
The 'NOID_' IDs will
be removed during the merging step, ttt:merge-results
(file ttt-5-expand-placeholders.xpl
).
There is an additional type of ttt:placeholder
element:
- ttt:normalized-space
- An element that contains exactly one space character, plus an
@original
attribute. For counting and analysis purposes, it will look like an ordinary single space character. The original space-like content will be restored from the@original
attribute during the merging phase.
Finally, there is the opposite case of ignored content, which is generated content.
Scenario: In a TEI critical
apparatus, the text-critical notes are rendered in a distinct section. Each note is
preceded by the page and
line(s) that it applies to, plus some words of context. The context will be rendered
via the note’s
@target
attribute. If the task is to retrofit PDF line breaks into the source, the PDF lines
must
be matched against the notes. This can be done either by inserting a regex that would
match any page/line numbers,
followed by any context phrase, or by at least including the context phrase in the
note element so that it will
match the PDF line (modulo page/line numbers). Then we need to temporarily augment
the note with the context phrase,
but we need to make sure that it will be removed after the line breaks have been marked
up.
For this purpose, an element
- ttt:generated
- Mostly useful for line number retrofitting applications. Contains text that is not present at this location in the source XML. It can be a placeholder for a page number or for text from another place in the text, for example from a lemma that is repeated at the beginning of a note.
Side-by-Side Format
Tokenization Markup
The tokens will be marked up with ttt:token
and ttt:space
elements. These elements
cover the whole unexpanded paragraph.
Import
<p:import href="http://transpect.io/tokenized-to-tree/xpl/ttt-1-prepare-input.xpl"/>
Dependencies
Synopsis
<ttt:prepare-input xmlns:ttt="http://transpect.io/tokenized-to-tree">
<p:input port="source" primary="true"/>
<p:input port="stylesheet"/>
<p:output port="result" primary="true"/>
<p:output port="with-ids"/>
<p:option name="map-higher-unicode-planes" select="'no'"/>
<p:option name="debug" required="false" select="'no'"/>
<p:option name="debug-dir-uri" required="false" select="'debug'"/>
</ttt:prepare-input>
ttt:process-paras
Import
<p:import href="http://transpect.io/tokenized-to-tree/xpl/ttt-3-integrate-tokenizer-results.xpl"/>
Dependencies
Synopsis
<ttt:process-paras xmlns:ttt="http://transpect.io/tokenized-to-tree">
<p:input port="source" primary="true"/>
<p:input port="patch-token-stylesheet"/>
<p:input port="rng-schema"/>
<p:output port="result" primary="true"/>
<p:option name="debug" required="false" select="'no'"/>
<p:option name="debug-dir-uri" required="false" select="'debug'"/>
<p:option name="milestones-only" select="'no'"/>
<p:option name="map-higher-unicode-planes" select="'no'"/>
</ttt:process-paras>
ttt:merge-results
Import
<p:import href="http://transpect.io/tokenized-to-tree/xpl/ttt-5-expand-placeholders.xpl"/>
Dependencies
Synopsis
<ttt:merge-results xmlns:ttt="http://transpect.io/tokenized-to-tree">
<p:input port="source" primary="false"/>
<p:input port="patched-paras" primary="true"/>
<p:input port="stylesheet"/>
<p:input port="params"/>
<p:output port="result" primary="true"/>
<p:option name="debug" required="false" select="'no'"/>
<p:option name="debug-dir-uri" required="false" select="'debug'"/>
</ttt:merge-results>
GitHub sync date: 2025-01-08+01:00