cedict-tagger

Tag Chinese text with Pinyin, Bopomofo, and English translations using CEDict. A simple trie-based matching algorithm is used for tagging.

Using

Tagger Plugins add annotations to text that can be queried and composed later.

Blockifiers convert data into Steamship’s native Block format.

Importer Plugins add annotations to text that can be queried and composed later.

Use them when writing Packages to help you work with data of different types.

Links

from steamship import Steamship, File

client = Steamship(workspace="my-workspace-handle")

# Import a file to Steamship
with open("file.ext") as f:
  file = File.create(content=file.read())

# Create an instance of this blockifier
blockifier = client.use_plugin(
  'cedict-tagger'
)

# Blockify the file
task = file.blockify()
task.wait()
from steamship import Steamship, File

client = Steamship(workspace="my-workspace-handle")

# Import a file to Steamship
with open("file.ext") as f:
  file = File.create(content=file.read())

# Create a blockifier. We'll assume Markdown here.
blockifier = client.use_plugin(
  'markdown-blockifier-default'
)

# Blockify the file
task = file.blockify()

# Create an instance of this tagger.
tagger = client.use_plugin(
  'cedict-tagger'
)

# Tag the file
task = file.tag()

task.wait()
Pulled from the GitHub repository.
# CEDict Tagger Plugin - Steamship

This project contains a Steamship Tagger plugin that applies CEDict to a span of Mandarin Text.

## Output

The output is tags with:

* `kind`    - `token`
* `name` - `ce-dict`
* `value`     - A CE-Dict object, as defined below

The CE-Dict object represents the following with respect to the tagged text:

* `en` - English translation
* `trad` - Traditional characters
* `simp` - Simplified characters
* `pynum` - Pinyin (numeric style)
* `pyacc` - Pinyin (accent style)
* `zhuyin` - Zhuyin (bopomofo)

## Algorithm

A best-effort is made to match:

* The first entry in CEDict among competing alternatives
* The longest contiguous chunk of text, in greedy-search fashion

That approach certainly falls short whereas translation of meaning is concerned, 
but should fare well where word lookup and tokenization is concerned.



Steamship is building the fastest way to ship language AI.

Stop building and start shipping. Reach out to get onboarded.
Read More