wikipedia-blockifier

Convert Wikipedia HTML into clean Markdown-tagged text. Just import the Wikipedia URL as raw bytes and then apply this Blockifier.

Using

Tagger Plugins add annotations to text that can be queried and composed later.

Blockifiers convert data into Steamship’s native Block format.

Importer Plugins add annotations to text that can be queried and composed later.

Use them when writing Packages to help you work with data of different types.

Links

from steamship import Steamship, File

client = Steamship(workspace="my-workspace-handle")

# Import a file to Steamship
with open("file.ext") as f:
  file = File.create(content=file.read())

# Create an instance of this blockifier
blockifier = client.use_plugin(
  'wikipedia-blockifier'
)

# Blockify the file
task = file.blockify()
task.wait()
from steamship import Steamship, File

client = Steamship(workspace="my-workspace-handle")

# Import a file to Steamship
with open("file.ext") as f:
  file = File.create(content=file.read())

# Create a blockifier. We'll assume Markdown here.
blockifier = client.use_plugin(
  'markdown-blockifier-default'
)

# Blockify the file
task = file.blockify()

# Create an instance of this tagger.
tagger = client.use_plugin(
  'wikipedia-blockifier'
)

# Tag the file
task = file.tag()

task.wait()
Pulled from the GitHub repository.
# Wikipedia Blockifier (Steamship Plugin)

This Steamship Plugin converts Wikipedia HTML into Steamship Block format.
Import a Wikipedia page into Steamship and then apply this Blockifier to produce tagged text that can be queried and passed to other models.

## Usage

This plugin is auto-deployed to Steamship and available via handle: `wikipedia-file-importer`

Here is a complete example of using it:

```python
import requests

from steamship import Steamship, File, MimeTypes
from steamship.data.tags import Tag, DocTag

# Create your ~/steamship.json credentials by:
# 1) Installing the Steamship CLI: `npm install -g @steamship/cli`
# 2) Logging in `ship login` 
client = Steamship()   
    
# Add a Wikipedia page to Steamship
import requests
html = requests.get("https://en.wikipedia.org/wiki/Honey_badger")
file = File.create(client, content=html.content, mime_type=MimeTypes.HTML)

# Blockify it using this plugin. That converts it into Steamship Block format.
blockifier = client.use_plugin("wikipedia-blockifier", "my-wikipedia-blockifier")
blockify_task = file.blockify(plugin_instance=blockifier.handle)
blockify_task.wait()

# Refresh the file to pull down new data
file = file.refresh()

# Query for all H2 Elements
query_tags = Tag.query(
    client, f'kind "{DocTag.DOCUMENT}" and name "{DocTag.H2}"'
).tags

tags = [tag for tag in query_tags if tag.file_id == file.id]
print(f"There are {len(tags)} {DocTag.H2} elements in that page:")
for tag in tags:
    print(file.blocks[tag.block_id][tag.start_idx:tag.end_idx])
```

## Blockifier Strategy

This Blockifier uses knowledge of the Wikipedia's HTML structure to extract Steamship Block Format content. It includes
the title and the body of the page. Each major paragraph or list in the body is returned as a new block. Within
those blocks, each link, bold/italic/underline span element, sub-list, and the broader block type are tagged.

Whereas the input HTML elements are **nested**, the output Steamship Block structure is **plain text**, with **overlapping tags atop it**. 
The nested structure of the HTML could be recovered, if one wished, by reasoning about the way in which tags overlap.

The following diagram illustrates this approach to tagging:

![Diagram of Tagging Strategy](https://github.com/steamship-plugins/wikipedia-blockifier/blob/main/doc/tagging-strategy.png)

Steamship is building the fastest way to ship language AI.

Stop building and start shipping. Reach out to get onboarded.
Read More