package sentences

sentences.v1 – gopkg.in/neurosnap/sentences.v1 Index | Files | Directories

package sentences

import "gopkg.in/neurosnap/sentences.v1"

Package sentences is a golang package that will convert a blob of text into a list of sentences.

This package attempts to support a multitude of languages: Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Italian, Norwegian, Polish, Portuguese, Slovene, Spanish, Swedish, and Turkish.

An unsupervised multilingual sentence boundary detection library for golang. The goal of this library is to be able to break up any text into a list of sentences in multiple languages. The way the punkt system accomplishes this goal is through training the tokenizer with text in that given language. Once the likelihoods of abbreviations, collocations, and sentence starters are determined, finding sentence boundaries becomes easier.

There are many problems that arise when tokenizing text into sentences, the primary issue being abbreviations. The punkt system attempts to determine whether a word is an abbreviation, an end to a sentence, or even both through training the system with text in the given language. The punkt system incorporates both token- and type-based analysis on the text through two different phases of annotation.

Original research article: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=BAE5C34E5C3B9DC60DFC4D93B85D8BB1?doi=10.1.1.85.5017&rep=rep1&type=pdf

func NewAnnotations(s *Storage, p PunctStrings, word WordTokenizer) []AnnotateTokens

type DefaultPunctStrings

func NewPunctStrings() *DefaultPunctStrings
func (p *DefaultPunctStrings) HasSentencePunct(text string) bool
func (p *DefaultPunctStrings) NonPunct() string
func (p *DefaultPunctStrings) Punctuation() string

type DefaultSentenceTokenizer

func NewSentenceTokenizer(s *Storage) *DefaultSentenceTokenizer
func NewTokenizer(s *Storage, word WordTokenizer, lang PunctStrings) *DefaultSentenceTokenizer
func (s *DefaultSentenceTokenizer) AnnotateTokens(tokens []*Token, annotate ...AnnotateTokens) []*Token
func (s *DefaultSentenceTokenizer) AnnotatedTokens(text string) []*Token
func (s *DefaultSentenceTokenizer) SentencePositions(text string) []int
func (s *DefaultSentenceTokenizer) Tokenize(text string) []*Sentence

type DefaultTokenGrouper

func (p *DefaultTokenGrouper) Group(tokens []*Token) [][2]*Token

type DefaultWordTokenizer

func NewWordTokenizer(p PunctStrings) *DefaultWordTokenizer
func (p *DefaultWordTokenizer) FirstLower(t *Token) bool
func (p *DefaultWordTokenizer) FirstUpper(t *Token) bool
func (p *DefaultWordTokenizer) HasPeriodFinal(t *Token) bool
func (p *DefaultWordTokenizer) HasSentEndChars(t *Token) bool
func (p *DefaultWordTokenizer) IsAlpha(t *Token) bool
func (p *DefaultWordTokenizer) IsEllipsis(t *Token) bool
func (p *DefaultWordTokenizer) IsInitial(t *Token) bool
func (p *DefaultWordTokenizer) IsNonPunct(t *Token) bool
func (p *DefaultWordTokenizer) IsNumber(t *Token) bool
func (p *DefaultWordTokenizer) Tokenize(text string, onlyPeriodContext bool) []*Token
func (p *DefaultWordTokenizer) Type(t *Token) string
func (p *DefaultWordTokenizer) TypeNoPeriod(t *Token) string
func (p *DefaultWordTokenizer) TypeNoSentPeriod(t *Token) string

type Ortho
type OrthoContext

func (o *OrthoContext) Heuristic(token *Token) int

type PunctStrings
type Sentence

func (s Sentence) String() string

type SentenceTokenizer
type SetString

func (ss SetString) Add(str string)
func (ss SetString) Array() []string
func (ss SetString) Has(str string) bool
func (ss SetString) Remove(str string)

type Storage

func LoadTraining(data []byte) (*Storage, error)
func NewStorage() *Storage
func (p *Storage) IsAbbr(tokens ...string) bool

type Token

func NewToken(token string) *Token
func (p *Token) String() string

type TokenBasedAnnotation

func (a *TokenBasedAnnotation) Annotate(tokens []*Token) []*Token

type TokenExistential
type TokenFirst
type TokenGrouper
type TokenParser
type TokenType
type TypeBasedAnnotation

func NewTypeBasedAnnotation(s *Storage, p PunctStrings, e TokenExistential) *TypeBasedAnnotation
func (a *TypeBasedAnnotation) Annotate(tokens []*Token) []*Token

type WordTokenizer

Functions ¶

func IsCjkPunct ¶

func IsCjkPunct(r rune) bool

Types ¶

type AnnotateTokens ¶

type AnnotateTokens interface {
	Annotate([]*Token) []*Token
}

AnnotateTokens is an interface used for the sentence tokenizer to add properties to any given token during tokenization.

func NewAnnotations ¶

func NewAnnotations(s *Storage, p PunctStrings, word WordTokenizer) []AnnotateTokens

NewAnnotations is the default AnnotateTokens struct that the tokenizer uses

type DefaultPunctStrings ¶

type DefaultPunctStrings struct{}

DefaultPunctStrings are used to detect punctuation in the sentence tokenizer.

func NewPunctStrings ¶

func NewPunctStrings() *DefaultPunctStrings

NewPunctStrings creates a default set of properties

func (*DefaultPunctStrings) HasSentencePunct ¶

func (p *DefaultPunctStrings) HasSentencePunct(text string) bool

HasSentencePunct does the supplied text have a known sentence punctuation character?

func (*DefaultPunctStrings) NonPunct ¶

func (p *DefaultPunctStrings) NonPunct() string

NonPunct regex string to detect non-punctuation.

func (*DefaultPunctStrings) Punctuation ¶

func (p *DefaultPunctStrings) Punctuation() string

Punctuation characters

type DefaultSentenceTokenizer ¶

type DefaultSentenceTokenizer struct {
	*Storage
	WordTokenizer
	PunctStrings
	Annotations []AnnotateTokens
}

DefaultSentenceTokenizer is a sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences and then uses that model to find sentence boundaries.

func NewSentenceTokenizer ¶

func NewSentenceTokenizer(s *Storage) *DefaultSentenceTokenizer

NewSentenceTokenizer are the sane defaults for the sentence tokenizer

func NewTokenizer ¶

func NewTokenizer(s *Storage, word WordTokenizer, lang PunctStrings) *DefaultSentenceTokenizer

NewTokenizer wraps around DST doing the work for customizing the tokenizer

func (*DefaultSentenceTokenizer) AnnotateTokens ¶

func (s *DefaultSentenceTokenizer) AnnotateTokens(tokens []*Token, annotate ...AnnotateTokens) []*Token

AnnotateTokens given a set of tokens augmented with markers for line-start and paragraph-start, returns an iterator through those tokens with full annotation including predicted sentence breaks.

func (*DefaultSentenceTokenizer) AnnotatedTokens ¶

func (s *DefaultSentenceTokenizer) AnnotatedTokens(text string) []*Token

AnnotatedTokens are the fully annotated word tokens. This allows for adhoc adjustments to the tokens

func (*DefaultSentenceTokenizer) SentencePositions ¶

func (s *DefaultSentenceTokenizer) SentencePositions(text string) []int

SentencePositions returns an array of positions instead of returning an array of sentences.

func (*DefaultSentenceTokenizer) Tokenize ¶

func (s *DefaultSentenceTokenizer) Tokenize(text string) []*Sentence

Tokenize splits text input into sentence tokens.

type DefaultTokenGrouper ¶

type DefaultTokenGrouper struct{}

DefaultTokenGrouper is the default implementation of TokenGrouper

func (*DefaultTokenGrouper) Group ¶

func (p *DefaultTokenGrouper) Group(tokens []*Token) [][2]*Token

Group is the primary logic for implementing TokenGrouper

type DefaultWordTokenizer ¶

type DefaultWordTokenizer struct {
	PunctStrings
}

DefaultWordTokenizer is the default implementation of the WordTokenizer

func NewWordTokenizer ¶

func NewWordTokenizer(p PunctStrings) *DefaultWordTokenizer

NewWordTokenizer creates a new DefaultWordTokenizer

func (*DefaultWordTokenizer) FirstLower ¶

func (p *DefaultWordTokenizer) FirstLower(t *Token) bool

FirstLower is true if the token's first character is lowercase

func (*DefaultWordTokenizer) FirstUpper ¶

func (p *DefaultWordTokenizer) FirstUpper(t *Token) bool

FirstUpper is true if the token's first character is uppercase.

func (*DefaultWordTokenizer) HasPeriodFinal ¶

func (p *DefaultWordTokenizer) HasPeriodFinal(t *Token) bool

HasPeriodFinal is true if the last character in the word is a period

func (*DefaultWordTokenizer) HasSentEndChars ¶

func (p *DefaultWordTokenizer) HasSentEndChars(t *Token) bool

HasSentEndChars finds any punctuation excluding the period final

func (*DefaultWordTokenizer) IsAlpha ¶

func (p *DefaultWordTokenizer) IsAlpha(t *Token) bool

IsAlpha is true if the token text is all alphabetic.

func (*DefaultWordTokenizer) IsEllipsis ¶

func (p *DefaultWordTokenizer) IsEllipsis(t *Token) bool

IsEllipsis is true if the token text is that of an ellipsis.

func (*DefaultWordTokenizer) IsInitial ¶

func (p *DefaultWordTokenizer) IsInitial(t *Token) bool

IsInitial is true if the token text is that of an initial.

func (*DefaultWordTokenizer) IsNonPunct ¶

func (p *DefaultWordTokenizer) IsNonPunct(t *Token) bool

IsNonPunct is true if the token is either a number or is alphabetic.

func (*DefaultWordTokenizer) IsNumber ¶

func (p *DefaultWordTokenizer) IsNumber(t *Token) bool

IsNumber is true if the token text is that of a number.

func (*DefaultWordTokenizer) Tokenize ¶

func (p *DefaultWordTokenizer) Tokenize(text string, onlyPeriodContext bool) []*Token

Tokenize breaks text into words while preserving their character position, whether it starts a new line, and new paragraph.

func (*DefaultWordTokenizer) Type ¶

func (p *DefaultWordTokenizer) Type(t *Token) string

Type returns a case-normalized representation of the token.

func (*DefaultWordTokenizer) TypeNoPeriod ¶

func (p *DefaultWordTokenizer) TypeNoPeriod(t *Token) string

TypeNoPeriod is the type with its final period removed if it has one.

func (*DefaultWordTokenizer) TypeNoSentPeriod ¶

func (p *DefaultWordTokenizer) TypeNoSentPeriod(t *Token) string

TypeNoSentPeriod is the type with its final period removed if it is marked as a sentence break.

type Ortho ¶

type Ortho interface {
	Heuristic(*Token) int
}

Ortho creates a promise for structs to implement an orthogonal heuristic method.

type OrthoContext ¶

type OrthoContext struct {
	*Storage
	PunctStrings
	TokenType
	TokenFirst
}

OrthoContext determines whether a token is capitalized, sentence starter, etc.

func (*OrthoContext) Heuristic ¶

func (o *OrthoContext) Heuristic(token *Token) int

Heuristic decides whether the given token is the first token in a sentence.

type PunctStrings ¶

type PunctStrings interface {
	NonPunct() string
	Punctuation() string
	HasSentencePunct(string) bool
}

PunctStrings implements all the functions necessary for punctuation strings. They are used to detect punctuation in the sentence tokenizer.

type Sentence ¶

type Sentence struct {
	Start int    `json:"start"`
	End   int    `json:"end"`
	Text  string `json:"text"`
}

Sentence container to hold sentences, provides the character positions as well as the text for that sentence.

func (Sentence) String ¶

func (s Sentence) String() string

type SentenceTokenizer ¶

type SentenceTokenizer interface {
	AnnotateTokens([]*Token, ...AnnotateTokens) []*Token
	Tokenize(string) []*Sentence
}

SentenceTokenizer interface is used by the Tokenize function, can be extended to correct sentence boundaries that punkt misses.

type SetString ¶

type SetString map[string]int

SetString is an implementation of a set of strings probably not the best way to do this but oh well.

func (SetString) Add ¶

func (ss SetString) Add(str string)

Add adds a string key to the set

func (SetString) Array ¶

func (ss SetString) Array() []string

Array returns and array of keys from the set

func (SetString) Has ¶

func (ss SetString) Has(str string) bool

Has checks whether a key exists in the set

func (SetString) Remove ¶

func (ss SetString) Remove(str string)

Remove deletes a string key from the set

type Storage ¶

type Storage struct {
	AbbrevTypes  SetString `json:"AbbrevTypes"`
	Collocations SetString `json:"Collocations"`
	SentStarters SetString `json:"SentStarters"`
	OrthoContext SetString `json:"OrthoContext"`
}

Storage stores data used to perform sentence boundary detection with punkt This is where all the training data gets stored for future use

func LoadTraining ¶

func LoadTraining(data []byte) (*Storage, error)

LoadTraining is the primary function to load JSON training data. By default, the sentence tokenizer loads in english automatically, but other languages could be loaded into a binary file using the `make <lang>` command.

func NewStorage ¶

func NewStorage() *Storage

NewStorage creates the default storage container

func (*Storage) IsAbbr ¶

func (p *Storage) IsAbbr(tokens ...string) bool

IsAbbr detemines if any of the tokens are an abbreviation

type Token ¶

type Token struct {
	Tok       string
	Position  int
	SentBreak bool
	ParaStart bool
	LineStart bool
	Abbr      bool
	// contains filtered or unexported fields
}

Token stores a token of text with annotations produced during sentence boundary detection.

func NewToken ¶

func NewToken(token string) *Token

NewToken is the default implementation of the Token struct

func (*Token) String ¶

func (p *Token) String() string

String is the string representation of Token

type TokenBasedAnnotation ¶

type TokenBasedAnnotation struct {
	*Storage
	PunctStrings
	TokenParser
	TokenGrouper
	Ortho
}

TokenBasedAnnotation performs a token-based classification (section 4) over the given tokens, making use of the orthographic heuristic (4.1.1), collocation heuristic (4.1.2) and frequent sentence starter heuristic (4.1.3).

func (*TokenBasedAnnotation) Annotate ¶

func (a *TokenBasedAnnotation) Annotate(tokens []*Token) []*Token

Annotate iterates groups tokens in pairs of two and then iterates over them to apply token annotation

type TokenExistential ¶

type TokenExistential interface {
	// True if the token text is all alphabetic.
	IsAlpha(*Token) bool
	// True if the token text is that of an ellipsis.
	IsEllipsis(*Token) bool
	// True if the token text is that of an initial.
	IsInitial(*Token) bool
	// True if the token text is that of a number.
	IsNumber(*Token) bool
	// True if the token is either a number or is alphabetic.
	IsNonPunct(*Token) bool
	// Does this token end with a period?
	HasPeriodFinal(*Token) bool
	// Does this token end with a punctuation and a quote?
	HasSentEndChars(*Token) bool
}

TokenExistential are helpers to determine what type of token we are dealing with.

type TokenFirst ¶

type TokenFirst interface {
	// True if the token's first character is lowercase
	FirstLower(*Token) bool
	// True if the token's first character is uppercase.
	FirstUpper(*Token) bool
}

TokenFirst are helpers to determine the case of the token's first letter

type TokenGrouper ¶

type TokenGrouper interface {
	Group([]*Token) [][2]*Token
}

TokenGrouper two adjacent tokens together.

type TokenParser ¶

type TokenParser interface {
	TokenType
	TokenFirst
	TokenExistential
}

TokenParser is the primary token interface that determines the context and type of a tokenized word.

type TokenType ¶

type TokenType interface {
	Type(*Token) string
	// The type with its final period removed if it has one.
	TypeNoPeriod(*Token) string
	// The type with its final period removed if it is marked as a sentence break.
	TypeNoSentPeriod(*Token) string
}

TokenType are helpers to get the type of a token

type TypeBasedAnnotation ¶

type TypeBasedAnnotation struct {
	*Storage
	PunctStrings
	TokenExistential
}

TypeBasedAnnotation performs the first pass of annotation, which makes decisions based purely based on the word type of each word:

'?', '!', and '.' are marked as sentence breaks.
sequences of two or more periods are marked as ellipsis.
any word ending in '.' that's a known abbreviation is marked as an abbreviation.
any other word ending in '.' is marked as a sentence break.

Return these annotations as a tuple of three sets:

sentbreak_toks: The indices of all sentence breaks.
abbrev_toks: The indices of all abbreviations.
ellipsis_toks: The indices of all ellipsis marks.

func NewTypeBasedAnnotation ¶

func NewTypeBasedAnnotation(s *Storage, p PunctStrings, e TokenExistential) *TypeBasedAnnotation

NewTypeBasedAnnotation creates an instance of the TypeBasedAnnotation struct

func (*TypeBasedAnnotation) Annotate ¶

func (a *TypeBasedAnnotation) Annotate(tokens []*Token) []*Token

Annotate iterates over all tokens and applies the type annotation on them

type WordTokenizer ¶

type WordTokenizer interface {
	TokenParser
	Tokenize(string, bool) []*Token
}

WordTokenizer is the primary interface for tokenizing words

Source Files ¶

annotate.go main.go ortho.go punctuation.go sentence_tokenizer.go storage.go token.go word_tokenizer.go

Directories ¶

Path	Synopsis
_cmd
_cmd/sentences
data
english
utils

Version: v1.0.7 (latest)
Published: May 26, 2021
Platform: darwin/amd64
Imports: 5 packages
Last checked: now –

Tools for package owners.

?	: This menu
/	: Search site
f	: Jump to identifier
g then g	: Go to top of page
g then b	: Go to end of page
G	: Go to end of page
g then i	: Go to index
g then e	: Go to examples