Light Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Allow whitespace-only pieces #984

Open
Open
Allow whitespace-only pieces#984
Labels

Description

From what I understand, the allow_whitespace_only_pieces training argument, implemented in the word-level pretokeniser at this line, allows multiple spaces to appear next to each other in the strings that result from the pretokeniser (let's call them "pre-tokens"). Because the trainer gets its substrings from inside pre-tokens, having multiple spaces in one pre-token allows it to learn tokens consisting of more than one space.

I have two questions:

  1. Is this not a confusing way to name this option? When allow_whitespace_only_pieces is false, it produces pre-tokens that consist of whitespace only, which is completely counterintuitive. (It also means that there will be at least one token allowed that is whitespace-only.)
  2. For my application, what I need is what you would actually expect the option "allow whitespace-only pieces" to do, which is to produce pre-tokens with only whitespace and never mix whitespace with non-whitespace in tokens. Is this straight-forward to do by setting training options, or does it need extra implementation?

To illustrate all of this with an example: the sentence This is a test sentence. is split as follows in the three cases outlined above:

  • allow_whitespace_only_pieces = false: This #is #a # # # #test #sentence. (seemingly allows pieces that are whitespace-only)
  • allow_whitespace_only_pieces = true: This #is #a ####test #sentence.
  • What I need: This # is # a #### test # sentence.

Thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions