package sentences
import "gopkg.in/neurosnap/sentences.v1"
Package sentences is a golang package that will convert a blob of text into a list of sentences.
This package attempts to support a multitude of languages: Czech, Danish, Dutch, English, Estonian, Finnish, French, German, Greek, Italian, Norwegian, Polish, Portuguese, Slovene, Spanish, Swedish, and Turkish.
An unsupervised multilingual sentence boundary detection library for golang. The goal of this library is to be able to break up any text into a list of sentences in multiple languages. The way the punkt system accomplishes this goal is through training the tokenizer with text in that given language. Once the likelihoods of abbreviations, collocations, and sentence starters are determined, finding sentence boundaries becomes easier.
There are many problems that arise when tokenizing text into sentences, the primary issue being abbreviations. The punkt system attempts to determine whether a word is an abbreviation, an end to a sentence, or even both through training the system with text in the given language. The punkt system incorporates both token- and type-based analysis on the text through two different phases of annotation.
Original research article: http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=BAE5C34E5C3B9DC60DFC4D93B85D8BB1?doi=10.1.1.85.5017&rep=rep1&type=pdf
Index ¶
- func IsCjkPunct(r rune) bool
- type AnnotateTokens
- type DefaultPunctStrings
- func NewPunctStrings() *DefaultPunctStrings
- func (p *DefaultPunctStrings) HasSentencePunct(text string) bool
- func (p *DefaultPunctStrings) NonPunct() string
- func (p *DefaultPunctStrings) Punctuation() string
- type DefaultSentenceTokenizer
- func NewSentenceTokenizer(s *Storage) *DefaultSentenceTokenizer
- func NewTokenizer(s *Storage, word WordTokenizer, lang PunctStrings) *DefaultSentenceTokenizer
- func (s *DefaultSentenceTokenizer) AnnotateTokens(tokens []*Token, annotate ...AnnotateTokens) []*Token
- func (s *DefaultSentenceTokenizer) AnnotatedTokens(text string) []*Token
- func (s *DefaultSentenceTokenizer) SentencePositions(text string) []int
- func (s *DefaultSentenceTokenizer) Tokenize(text string) []*Sentence
- type DefaultTokenGrouper
- type DefaultWordTokenizer
- func NewWordTokenizer(p PunctStrings) *DefaultWordTokenizer
- func (p *DefaultWordTokenizer) FirstLower(t *Token) bool
- func (p *DefaultWordTokenizer) FirstUpper(t *Token) bool
- func (p *DefaultWordTokenizer) HasPeriodFinal(t *Token) bool
- func (p *DefaultWordTokenizer) HasSentEndChars(t *Token) bool
- func (p *DefaultWordTokenizer) IsAlpha(t *Token) bool
- func (p *DefaultWordTokenizer) IsEllipsis(t *Token) bool
- func (p *DefaultWordTokenizer) IsInitial(t *Token) bool
- func (p *DefaultWordTokenizer) IsNonPunct(t *Token) bool
- func (p *DefaultWordTokenizer) IsNumber(t *Token) bool
- func (p *DefaultWordTokenizer) Tokenize(text string, onlyPeriodContext bool) []*Token
- func (p *DefaultWordTokenizer) Type(t *Token) string
- func (p *DefaultWordTokenizer) TypeNoPeriod(t *Token) string
- func (p *DefaultWordTokenizer) TypeNoSentPeriod(t *Token) string
- type Ortho
- type OrthoContext
- type PunctStrings
- type Sentence
- type SentenceTokenizer
- type SetString
- func (ss SetString) Add(str string)
- func (ss SetString) Array() []string
- func (ss SetString) Has(str string) bool
- func (ss SetString) Remove(str string)
- type Storage
- func LoadTraining(data []byte) (*Storage, error)
- func NewStorage() *Storage
- func (p *Storage) IsAbbr(tokens ...string) bool
- type Token
- type TokenBasedAnnotation
- type TokenExistential
- type TokenFirst
- type TokenGrouper
- type TokenParser
- type TokenType
- type TypeBasedAnnotation
- func NewTypeBasedAnnotation(s *Storage, p PunctStrings, e TokenExistential) *TypeBasedAnnotation
- func (a *TypeBasedAnnotation) Annotate(tokens []*Token) []*Token
- type WordTokenizer
Functions ¶
func IsCjkPunct ¶
Types ¶
type AnnotateTokens ¶
AnnotateTokens is an interface used for the sentence tokenizer to add properties to any given token during tokenization.
func NewAnnotations ¶
func NewAnnotations(s *Storage, p PunctStrings, word WordTokenizer) []AnnotateTokens
NewAnnotations is the default AnnotateTokens struct that the tokenizer uses
type DefaultPunctStrings ¶
type DefaultPunctStrings struct{}
DefaultPunctStrings are used to detect punctuation in the sentence tokenizer.
func NewPunctStrings ¶
func NewPunctStrings() *DefaultPunctStrings
NewPunctStrings creates a default set of properties
func (*DefaultPunctStrings) HasSentencePunct ¶
func (p *DefaultPunctStrings) HasSentencePunct(text string) bool
HasSentencePunct does the supplied text have a known sentence punctuation character?
func (*DefaultPunctStrings) NonPunct ¶
func (p *DefaultPunctStrings) NonPunct() string
NonPunct regex string to detect non-punctuation.
func (*DefaultPunctStrings) Punctuation ¶
func (p *DefaultPunctStrings) Punctuation() string
Punctuation characters
type DefaultSentenceTokenizer ¶
type DefaultSentenceTokenizer struct { *Storage WordTokenizer PunctStrings Annotations []AnnotateTokens }
DefaultSentenceTokenizer is a sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences and then uses that model to find sentence boundaries.
func NewSentenceTokenizer ¶
func NewSentenceTokenizer(s *Storage) *DefaultSentenceTokenizer
NewSentenceTokenizer are the sane defaults for the sentence tokenizer
func NewTokenizer ¶
func NewTokenizer(s *Storage, word WordTokenizer, lang PunctStrings) *DefaultSentenceTokenizer
NewTokenizer wraps around DST doing the work for customizing the tokenizer
func (*DefaultSentenceTokenizer) AnnotateTokens ¶
func (s *DefaultSentenceTokenizer) AnnotateTokens(tokens []*Token, annotate ...AnnotateTokens) []*Token
AnnotateTokens given a set of tokens augmented with markers for line-start and paragraph-start, returns an iterator through those tokens with full annotation including predicted sentence breaks.
func (*DefaultSentenceTokenizer) AnnotatedTokens ¶
func (s *DefaultSentenceTokenizer) AnnotatedTokens(text string) []*Token
AnnotatedTokens are the fully annotated word tokens. This allows for adhoc adjustments to the tokens
func (*DefaultSentenceTokenizer) SentencePositions ¶
func (s *DefaultSentenceTokenizer) SentencePositions(text string) []int
SentencePositions returns an array of positions instead of returning an array of sentences.
func (*DefaultSentenceTokenizer) Tokenize ¶
func (s *DefaultSentenceTokenizer) Tokenize(text string) []*Sentence
Tokenize splits text input into sentence tokens.
type DefaultTokenGrouper ¶
type DefaultTokenGrouper struct{}
DefaultTokenGrouper is the default implementation of TokenGrouper
func (*DefaultTokenGrouper) Group ¶
func (p *DefaultTokenGrouper) Group(tokens []*Token) [][2]*Token
Group is the primary logic for implementing TokenGrouper
type DefaultWordTokenizer ¶
type DefaultWordTokenizer struct { PunctStrings }
DefaultWordTokenizer is the default implementation of the WordTokenizer
func NewWordTokenizer ¶
func NewWordTokenizer(p PunctStrings) *DefaultWordTokenizer
NewWordTokenizer creates a new DefaultWordTokenizer
func (*DefaultWordTokenizer) FirstLower ¶
func (p *DefaultWordTokenizer) FirstLower(t *Token) bool
FirstLower is true if the token's first character is lowercase
func (*DefaultWordTokenizer) FirstUpper ¶
func (p *DefaultWordTokenizer) FirstUpper(t *Token) bool
FirstUpper is true if the token's first character is uppercase.
func (*DefaultWordTokenizer) HasPeriodFinal ¶
func (p *DefaultWordTokenizer) HasPeriodFinal(t *Token) bool
HasPeriodFinal is true if the last character in the word is a period
func (*DefaultWordTokenizer) HasSentEndChars ¶
func (p *DefaultWordTokenizer) HasSentEndChars(t *Token) bool
HasSentEndChars finds any punctuation excluding the period final
func (*DefaultWordTokenizer) IsAlpha ¶
func (p *DefaultWordTokenizer) IsAlpha(t *Token) bool
IsAlpha is true if the token text is all alphabetic.
func (*DefaultWordTokenizer) IsEllipsis ¶
func (p *DefaultWordTokenizer) IsEllipsis(t *Token) bool
IsEllipsis is true if the token text is that of an ellipsis.
func (*DefaultWordTokenizer) IsInitial ¶
func (p *DefaultWordTokenizer) IsInitial(t *Token) bool
IsInitial is true if the token text is that of an initial.
func (*DefaultWordTokenizer) IsNonPunct ¶
func (p *DefaultWordTokenizer) IsNonPunct(t *Token) bool
IsNonPunct is true if the token is either a number or is alphabetic.
func (*DefaultWordTokenizer) IsNumber ¶
func (p *DefaultWordTokenizer) IsNumber(t *Token) bool
IsNumber is true if the token text is that of a number.
func (*DefaultWordTokenizer) Tokenize ¶
func (p *DefaultWordTokenizer) Tokenize(text string, onlyPeriodContext bool) []*Token
Tokenize breaks text into words while preserving their character position, whether it starts a new line, and new paragraph.
func (*DefaultWordTokenizer) Type ¶
func (p *DefaultWordTokenizer) Type(t *Token) string
Type returns a case-normalized representation of the token.
func (*DefaultWordTokenizer) TypeNoPeriod ¶
func (p *DefaultWordTokenizer) TypeNoPeriod(t *Token) string
TypeNoPeriod is the type with its final period removed if it has one.
func (*DefaultWordTokenizer) TypeNoSentPeriod ¶
func (p *DefaultWordTokenizer) TypeNoSentPeriod(t *Token) string
TypeNoSentPeriod is the type with its final period removed if it is marked as a sentence break.
type Ortho ¶
Ortho creates a promise for structs to implement an orthogonal heuristic method.
type OrthoContext ¶
type OrthoContext struct { *Storage PunctStrings TokenType TokenFirst }
OrthoContext determines whether a token is capitalized, sentence starter, etc.
func (*OrthoContext) Heuristic ¶
func (o *OrthoContext) Heuristic(token *Token) int
Heuristic decides whether the given token is the first token in a sentence.
type PunctStrings ¶
type PunctStrings interface { NonPunct() string Punctuation() string HasSentencePunct(string) bool }
PunctStrings implements all the functions necessary for punctuation strings. They are used to detect punctuation in the sentence tokenizer.
type Sentence ¶
Sentence container to hold sentences, provides the character positions as well as the text for that sentence.
func (Sentence) String ¶
type SentenceTokenizer ¶
type SentenceTokenizer interface { AnnotateTokens([]*Token, ...AnnotateTokens) []*Token Tokenize(string) []*Sentence }
SentenceTokenizer interface is used by the Tokenize function, can be extended to correct sentence boundaries that punkt misses.
type SetString ¶
SetString is an implementation of a set of strings probably not the best way to do this but oh well.
func (SetString) Add ¶
Add adds a string key to the set
func (SetString) Array ¶
Array returns and array of keys from the set
func (SetString) Has ¶
Has checks whether a key exists in the set
func (SetString) Remove ¶
Remove deletes a string key from the set
type Storage ¶
type Storage struct { AbbrevTypes SetString `json:"AbbrevTypes"` Collocations SetString `json:"Collocations"` SentStarters SetString `json:"SentStarters"` OrthoContext SetString `json:"OrthoContext"` }
Storage stores data used to perform sentence boundary detection with punkt This is where all the training data gets stored for future use
func LoadTraining ¶
LoadTraining is the primary function to load JSON training data. By default, the sentence tokenizer loads in english automatically, but other languages could be loaded into a binary file using the `make <lang>` command.
func NewStorage ¶
func NewStorage() *Storage
NewStorage creates the default storage container
func (*Storage) IsAbbr ¶
IsAbbr detemines if any of the tokens are an abbreviation
type Token ¶
type Token struct { Tok string Position int SentBreak bool ParaStart bool LineStart bool Abbr bool // contains filtered or unexported fields }
Token stores a token of text with annotations produced during sentence boundary detection.
func NewToken ¶
NewToken is the default implementation of the Token struct
func (*Token) String ¶
String is the string representation of Token
type TokenBasedAnnotation ¶
type TokenBasedAnnotation struct { *Storage PunctStrings TokenParser TokenGrouper Ortho }
TokenBasedAnnotation performs a token-based classification (section 4) over the given tokens, making use of the orthographic heuristic (4.1.1), collocation heuristic (4.1.2) and frequent sentence starter heuristic (4.1.3).
func (*TokenBasedAnnotation) Annotate ¶
func (a *TokenBasedAnnotation) Annotate(tokens []*Token) []*Token
Annotate iterates groups tokens in pairs of two and then iterates over them to apply token annotation
type TokenExistential ¶
type TokenExistential interface { // True if the token text is all alphabetic. IsAlpha(*Token) bool // True if the token text is that of an ellipsis. IsEllipsis(*Token) bool // True if the token text is that of an initial. IsInitial(*Token) bool // True if the token text is that of a number. IsNumber(*Token) bool // True if the token is either a number or is alphabetic. IsNonPunct(*Token) bool // Does this token end with a period? HasPeriodFinal(*Token) bool // Does this token end with a punctuation and a quote? HasSentEndChars(*Token) bool }
TokenExistential are helpers to determine what type of token we are dealing with.
type TokenFirst ¶
type TokenFirst interface { // True if the token's first character is lowercase FirstLower(*Token) bool // True if the token's first character is uppercase. FirstUpper(*Token) bool }
TokenFirst are helpers to determine the case of the token's first letter
type TokenGrouper ¶
TokenGrouper two adjacent tokens together.
type TokenParser ¶
type TokenParser interface { TokenType TokenFirst TokenExistential }
TokenParser is the primary token interface that determines the context and type of a tokenized word.
type TokenType ¶
type TokenType interface { Type(*Token) string // The type with its final period removed if it has one. TypeNoPeriod(*Token) string // The type with its final period removed if it is marked as a sentence break. TypeNoSentPeriod(*Token) string }
TokenType are helpers to get the type of a token
type TypeBasedAnnotation ¶
type TypeBasedAnnotation struct { *Storage PunctStrings TokenExistential }
TypeBasedAnnotation performs the first pass of annotation, which makes decisions based purely based on the word type of each word:
- '?', '!', and '.' are marked as sentence breaks.
- sequences of two or more periods are marked as ellipsis.
- any word ending in '.' that's a known abbreviation is marked as an abbreviation.
- any other word ending in '.' is marked as a sentence break.
Return these annotations as a tuple of three sets:
- sentbreak_toks: The indices of all sentence breaks.
- abbrev_toks: The indices of all abbreviations.
- ellipsis_toks: The indices of all ellipsis marks.
func NewTypeBasedAnnotation ¶
func NewTypeBasedAnnotation(s *Storage, p PunctStrings, e TokenExistential) *TypeBasedAnnotation
NewTypeBasedAnnotation creates an instance of the TypeBasedAnnotation struct
func (*TypeBasedAnnotation) Annotate ¶
func (a *TypeBasedAnnotation) Annotate(tokens []*Token) []*Token
Annotate iterates over all tokens and applies the type annotation on them
type WordTokenizer ¶
type WordTokenizer interface { TokenParser Tokenize(string, bool) []*Token }
WordTokenizer is the primary interface for tokenizing words
Source Files ¶
annotate.go main.go ortho.go punctuation.go sentence_tokenizer.go storage.go token.go word_tokenizer.go
Directories ¶
Path | Synopsis |
---|---|
_cmd | |
_cmd/sentences | |
data | |
english | |
utils |
- Version
- v1.0.7 (latest)
- Published
- May 26, 2021
- Platform
- darwin/amd64
- Imports
- 5 packages
- Last checked
- now –
Tools for package owners.