package tokenize

It does so by tokenizing text into sentences (using a port of NLTK's punkt tokenizer; see https://github.com/neurosnap/sentences) and then tokenizing the sentences into words via TreebankWordTokenizer.

Types ¶

type PragmaticSegmenter ¶

type PragmaticSegmenter struct {
	// contains filtered or unexported fields
}

PragmaticSegmenter is a multilingual, rule-based sentence boundary detector.

This is a port of the Ruby library by Kevin S. Dias (https://github.com/diasks2/pragmatic_segmenter).

func NewPragmaticSegmenter ¶

func NewPragmaticSegmenter(lang string) (*PragmaticSegmenter, error)

NewPragmaticSegmenter creates a new PragmaticSegmenter according to the specified language. If the given language is not supported, an error will be returned.

Languages are specified by their two-character ISO 639-1 code. The supported languages are "en" (English), "es" (Spanish), "fr" (French) ... (WIP)

func (*PragmaticSegmenter) Tokenize ¶

func (p *PragmaticSegmenter) Tokenize(text string) []string

Tokenize splits text into sentences.

type ProseTokenizer ¶

type ProseTokenizer interface {
	Tokenize(text string) []string
}

ProseTokenizer is the interface implemented by an object that takes a string and returns a slice of substrings.

type PunktSentenceTokenizer ¶

type PunktSentenceTokenizer struct {
	// contains filtered or unexported fields
}

PunktSentenceTokenizer is an extension of the Go implementation of the Punkt sentence tokenizer (https://github.com/neurosnap/sentences), with a few minor improvements (see https://github.com/neurosnap/sentences/pull/18).

func NewPunktSentenceTokenizer ¶

func NewPunktSentenceTokenizer() *PunktSentenceTokenizer

NewPunktSentenceTokenizer creates a new PunktSentenceTokenizer and loads its English model.

func (PunktSentenceTokenizer) Tokenize ¶

func (p PunktSentenceTokenizer) Tokenize(text string) []string

Tokenize splits text into sentences.

type RegexpTokenizer ¶

type RegexpTokenizer struct {
	// contains filtered or unexported fields
}

RegexpTokenizer splits a string into substrings using a regular expression.

func NewBlanklineTokenizer ¶

func NewBlanklineTokenizer() *RegexpTokenizer

NewBlanklineTokenizer is a RegexpTokenizer constructor.

This tokenizer splits on any sequence of blank lines.

Example¶

Code:

{
	t := NewBlanklineTokenizer()
	fmt.Println(t.Tokenize("They'll save and invest more.\n\nThanks!"))
	// Output: [They'll save and invest more. Thanks!]
}

Output:

[They'll save and invest more. Thanks!]

func NewRegexpTokenizer ¶

func NewRegexpTokenizer(pattern string, gaps, discard bool) *RegexpTokenizer

NewRegexpTokenizer is a RegexpTokenizer constructor that takes three arguments: a pattern to base the tokenizer on, a boolean value indicating whether or not to look for separators between tokens, and boolean value indicating whether or not to discard empty tokens.

func NewWordBoundaryTokenizer ¶

func NewWordBoundaryTokenizer() *RegexpTokenizer

NewWordBoundaryTokenizer is a RegexpTokenizer constructor.

This tokenizer splits text into a sequence of word-like tokens.

Example¶

Code:

{
	t := NewWordBoundaryTokenizer()
	fmt.Println(t.Tokenize("They'll save and invest more."))
	// Output: [They'll save and invest more]
}

Output:

[They'll save and invest more]

func NewWordPunctTokenizer ¶

func NewWordPunctTokenizer() *RegexpTokenizer

NewWordPunctTokenizer is a RegexpTokenizer constructor.

This tokenizer splits text into a sequence of alphabetic and non-alphabetic characters.

Example¶

Code:

{
	t := NewWordPunctTokenizer()
	fmt.Println(t.Tokenize("They'll save and invest more."))
	// Output: [They ' ll save and invest more .]
}

Output:

[They ' ll save and invest more .]

func (RegexpTokenizer) Tokenize ¶

func (r RegexpTokenizer) Tokenize(text string) []string

Tokenize splits text into a slice of tokens according to its regexp pattern.

type TreebankWordTokenizer ¶

type TreebankWordTokenizer struct {
}

TreebankWordTokenizer splits a sentence into words.

This implementation is a port of the Sed script written by Robert McIntyre, which is available at https://gist.github.com/jdkato/fc8b8c4266dba22d45ac85042ae53b1e.

func NewTreebankWordTokenizer ¶

func NewTreebankWordTokenizer() *TreebankWordTokenizer

NewTreebankWordTokenizer is a TreebankWordTokenizer constructor.

Example¶

Code:

{
	t := NewTreebankWordTokenizer()
	fmt.Println(t.Tokenize("They'll save and invest more."))
	// Output: [They 'll save and invest more .]
}

Output:

[They 'll save and invest more .]

func (TreebankWordTokenizer) Tokenize ¶

func (t TreebankWordTokenizer) Tokenize(text string) []string

Tokenize splits a sentence into a slice of words.

This tokenizer performs the following steps: (1) split on contractions (e.g., "don't" -> [do n't]), (2) split on non-terminating punctuation, (3) split on single quotes when followed by whitespace, and (4) split on periods that appear at the end of lines.

NOTE: As mentioned above, this function expects a sentence (not raw text) as input.

Source Files ¶

pragmatic.go punkt.go regexp.go tokenize.go treebank.go

Version: v1.2.1 (latest)
Published: Dec 22, 2020
Platform: darwin/amd64
Imports: 7 packages
Last checked: 5 seconds ago –

Tools for package owners.

?	: This menu
/	: Search site
f	: Jump to identifier
g then g	: Go to top of page
g then b	: Go to end of page
G	: Go to end of page
g then i	: Go to index
g then e	: Go to examples