package tokenize

import "github.com/jdkato/prose/tokenize"

Package tokenize implements functions to split strings into slices of substrings.

Index

Examples

Functions

func TextToWords

func TextToWords(text string) []string

TextToWords converts the string text into a slice of words.

It does so by tokenizing text into sentences (using a port of NLTK's punkt tokenizer; see https://github.com/neurosnap/sentences) and then tokenizing the sentences into words via TreebankWordTokenizer.

Types

type PragmaticSegmenter

type PragmaticSegmenter struct {
	// contains filtered or unexported fields
}

PragmaticSegmenter is a multilingual, rule-based sentence boundary detector.

This is a port of the Ruby library by Kevin S. Dias (https://github.com/diasks2/pragmatic_segmenter).

func NewPragmaticSegmenter

func NewPragmaticSegmenter(lang string) (*PragmaticSegmenter, error)

NewPragmaticSegmenter creates a new PragmaticSegmenter according to the specified language. If the given language is not supported, an error will be returned.

Languages are specified by their two-character ISO 639-1 code. The supported languages are "en" (English), "es" (Spanish), "fr" (French) ... (WIP)

func (*PragmaticSegmenter) Tokenize

func (p *PragmaticSegmenter) Tokenize(text string) []string

Tokenize splits text into sentences.

type ProseTokenizer

type ProseTokenizer interface {
	Tokenize(text string) []string
}

ProseTokenizer is the interface implemented by an object that takes a string and returns a slice of substrings.

type PunktSentenceTokenizer

type PunktSentenceTokenizer struct {
	// contains filtered or unexported fields
}

PunktSentenceTokenizer is an extension of the Go implementation of the Punkt sentence tokenizer (https://github.com/neurosnap/sentences), with a few minor improvements (see https://github.com/neurosnap/sentences/pull/18).

func NewPunktSentenceTokenizer

func NewPunktSentenceTokenizer() *PunktSentenceTokenizer

NewPunktSentenceTokenizer creates a new PunktSentenceTokenizer and loads its English model.

func (PunktSentenceTokenizer) Tokenize

func (p PunktSentenceTokenizer) Tokenize(text string) []string

Tokenize splits text into sentences.

type RegexpTokenizer

type RegexpTokenizer struct {
	// contains filtered or unexported fields
}

RegexpTokenizer splits a string into substrings using a regular expression.

func NewBlanklineTokenizer

func NewBlanklineTokenizer() *RegexpTokenizer

NewBlanklineTokenizer is a RegexpTokenizer constructor.

This tokenizer splits on any sequence of blank lines.

Example

Code:

{
	t := NewBlanklineTokenizer()
	fmt.Println(t.Tokenize("They'll save and invest more.\n\nThanks!"))
	// Output: [They'll save and invest more. Thanks!]
}

Output:

[They'll save and invest more. Thanks!]

func NewRegexpTokenizer

func NewRegexpTokenizer(pattern string, gaps, discard bool) *RegexpTokenizer

NewRegexpTokenizer is a RegexpTokenizer constructor that takes three arguments: a pattern to base the tokenizer on, a boolean value indicating whether or not to look for separators between tokens, and boolean value indicating whether or not to discard empty tokens.

func NewWordBoundaryTokenizer

func NewWordBoundaryTokenizer() *RegexpTokenizer

NewWordBoundaryTokenizer is a RegexpTokenizer constructor.

This tokenizer splits text into a sequence of word-like tokens.

Example

Code:

{
	t := NewWordBoundaryTokenizer()
	fmt.Println(t.Tokenize("They'll save and invest more."))
	// Output: [They'll save and invest more]
}

Output:

[They'll save and invest more]

func NewWordPunctTokenizer

func NewWordPunctTokenizer() *RegexpTokenizer

NewWordPunctTokenizer is a RegexpTokenizer constructor.

This tokenizer splits text into a sequence of alphabetic and non-alphabetic characters.

Example

Code:

{
	t := NewWordPunctTokenizer()
	fmt.Println(t.Tokenize("They'll save and invest more."))
	// Output: [They ' ll save and invest more .]
}

Output:

[They ' ll save and invest more .]

func (RegexpTokenizer) Tokenize

func (r RegexpTokenizer) Tokenize(text string) []string

Tokenize splits text into a slice of tokens according to its regexp pattern.

type TreebankWordTokenizer

type TreebankWordTokenizer struct {
}

TreebankWordTokenizer splits a sentence into words.

This implementation is a port of the Sed script written by Robert McIntyre, which is available at https://gist.github.com/jdkato/fc8b8c4266dba22d45ac85042ae53b1e.

func NewTreebankWordTokenizer

func NewTreebankWordTokenizer() *TreebankWordTokenizer

NewTreebankWordTokenizer is a TreebankWordTokenizer constructor.

Example

Code:

{
	t := NewTreebankWordTokenizer()
	fmt.Println(t.Tokenize("They'll save and invest more."))
	// Output: [They 'll save and invest more .]
}

Output:

[They 'll save and invest more .]

func (TreebankWordTokenizer) Tokenize

func (t TreebankWordTokenizer) Tokenize(text string) []string

Tokenize splits a sentence into a slice of words.

This tokenizer performs the following steps: (1) split on contractions (e.g., "don't" -> [do n't]), (2) split on non-terminating punctuation, (3) split on single quotes when followed by whitespace, and (4) split on periods that appear at the end of lines.

NOTE: As mentioned above, this function expects a sentence (not raw text) as input.

Source Files

pragmatic.go punkt.go regexp.go tokenize.go treebank.go

Version
v1.2.1 (latest)
Published
Dec 22, 2020
Platform
darwin/amd64
Imports
7 packages
Last checked
5 seconds ago

Tools for package owners.