package uniseg

uniseg – github.com/rivo/uniseg Index | Examples | Files

package uniseg

import "github.com/rivo/uniseg"

Package uniseg implements Unicode Text Segmentation and Unicode Line Breaking. Unicode Text Segmentation conforms to Unicode Standard Annex #29 (https://unicode.org/reports/tr29/) and Unicode Line Breaking conforms to Unicode Standard Annex #14 (https://unicode.org/reports/tr14/).

In short, using this package, you can split a string into grapheme clusters (what people would usually refer to as a "character"), into words, and into sentences. Or, in its simplest case, this package allows you to count the number of characters in a string, especially when it contains complex characters such as emojis, combining characters, or characters from Asian, Arabic, Hebrew, or other languages. Additionally, you can use it to implement line breaking (or "word wrapping"), that is, to determine where text can be broken over to the next line when the width of the line is not big enough to fit the entire text.

Grapheme Clusters

Consider the rainbow flag emoji: 🏳️‍🌈. On most modern systems, it appears as one character. But its string representation actually has 14 bytes, so counting bytes (or using len("🏳️‍🌈")) will not work as expected. Counting runes won't, either: The flag has 4 Unicode code points, thus 4 runes. The stdlib function utf8.RuneCountInString("🏳️‍🌈") and len([]rune("🏳️‍🌈")) will both return 4.

The uniseg.GraphemeClusterCount(str) function will return 1 for the rainbow flag emoji. The Graphemes class and a variety of functions in this package will allow you to split strings into its grapheme clusters.

Word Boundaries

Word boundaries are used in a number of different contexts. The most familiar ones are selection (double-click mouse selection), cursor movement ("move to next word" control-arrow keys), and the dialog option "Whole Word Search" for search and replace. This package provides methods for determining word boundaries.

Sentence Boundaries

Sentence boundaries are often used for triple-click or some other method of selecting or iterating through blocks of text that are larger than single words. They are also used to determine whether words occur within the same sentence in database queries. This package provides methods for determining sentence boundaries.

Line Breaking

Line breaking, also known as word wrapping, is the process of breaking a section of text into lines such that it will fit in the available width of a page, window or other display area. This package provides methods to determine the positions in a string where a line must be broken, may be broken, or must not be broken.

Index ¶

Constants
func FirstGraphemeCluster(b []byte, state int) (cluster, rest []byte, reserved, newState int)
func FirstGraphemeClusterInString(str string, state int) (cluster, rest string, reserved, newState int)
func FirstLineSegment(b []byte, state int) (segment, rest []byte, mustBreak bool, newState int)
func FirstLineSegmentInString(str string, state int) (segment, rest string, mustBreak bool, newState int)
func FirstSentence(b []byte, state int) (sentence, rest []byte, newState int)
func FirstSentenceInString(str string, state int) (sentence, rest string, newState int)
func FirstWord(b []byte, state int) (word, rest []byte, newState int)
func FirstWordInString(str string, state int) (word, rest string, newState int)
func GraphemeClusterCount(s string) (n int)
func Step(b []byte, state int) (cluster, rest []byte, boundaries int, newState int)
func StepString(str string, state int) (cluster, rest string, boundaries int, newState int)
type Graphemes

func NewGraphemes(s string) *Graphemes
func (g *Graphemes) Bytes() []byte
func (g *Graphemes) IsSentenceBoundary() bool
func (g *Graphemes) IsWordBoundary() bool
func (g *Graphemes) LineBreak() int
func (g *Graphemes) Next() bool
func (g *Graphemes) Positions() (int, int)
func (g *Graphemes) Reset()
func (g *Graphemes) Runes() []rune
func (g *Graphemes) Str() string

Examples ¶

Constants ¶

const (
	LineDontBreak = iota // You may not break the line here.
	LineCanBreak         // You may or may not break the line here.
	LineMustBreak        // You must break the line here.
)

These constants define whether a given text may be broken into the next line. If the break is optional (LineCanBreak), you may choose to break or not based on your own criteria, for example, if the text has reached the available width.

const (
	MaskLine     = 3
	MaskWord     = 4
	MaskSentence = 8
)

The bit masks used to extract boundary information returned by the Step() function.

Functions ¶

func FirstGraphemeCluster ¶

func FirstGraphemeCluster(b []byte, state int) (cluster, rest []byte, reserved, newState int)

FirstGraphemeCluster returns the first grapheme cluster found in the given byte slice according to the rules of Unicode Standard Annex #29, Grapheme Cluster Boundaries. This function can be called continuously to extract all grapheme clusters from a byte slice, as illustrated in the example below.

If you don't know the current state, for example when calling the function for the first time, you must pass -1. For consecutive calls, pass the state and rest slice returned by the previous call.

The "rest" slice is the sub-slice of the original byte slice "b" starting after the last byte of the identified grapheme cluster. If the length of the "rest" slice is 0, the entire byte slice "b" has been processed. The "cluster" byte slice is the sub-slice of the input slice containing the identified grapheme cluster.

Given an empty byte slice "b", the function returns nil values.

While slightly less convenient than using the Graphemes class, this function has much better performance and makes no allocations. It lends itself well to large byte slices.

The "reserved" return value is a placeholder for future functionality and may be ignored for the time being.

Example¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	b := []byte("🇩🇪🏳️‍🌈")
	state := -1
	var c []byte
	for len(b) > 0 {
		c, b, _, state = uniseg.FirstGraphemeCluster(b, state)
		fmt.Println(string(c))
	}
}

Output:

🇩🇪
🏳️‍🌈

func FirstGraphemeClusterInString ¶

func FirstGraphemeClusterInString(str string, state int) (cluster, rest string, reserved, newState int)

FirstGraphemeClusterInString is like FirstGraphemeCluster() but its input and outputs are strings.

Example¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	str := "🇩🇪🏳️‍🌈"
	state := -1
	var c string
	for len(str) > 0 {
		c, str, _, state = uniseg.FirstGraphemeClusterInString(str, state)
		fmt.Println(c)
	}
}

Output:

🇩🇪
🏳️‍🌈

func FirstLineSegment ¶

func FirstLineSegment(b []byte, state int) (segment, rest []byte, mustBreak bool, newState int)

FirstLineSegment returns the prefix of the given byte slice after which a decision to break the string over to the next line can or must be made, according to the rules of Unicode Standard Annex #14. This is used to implement line breaking.

Line breaking, also known as word wrapping, is the process of breaking a section of text into lines such that it will fit in the available width of a page, window or other display area.

The returned "segment" may not be broken into smaller parts, unless no other breaking opportunities present themselves, in which case you may break by grapheme clusters (using the FirstGraphemeCluster() function to determine the grapheme clusters).

The "mustBreak" flag indicates whether you MUST break the line after the given segment (true), for example after newline characters, or you MAY break the line after the given segment (false).

This function can be called continuously to extract all non-breaking sub-sets from a byte slice, as illustrated in the example below.

If you don't know the current state, for example when calling the function for the first time, you must pass -1. For consecutive calls, pass the state and rest slice returned by the previous call.

The "rest" slice is the sub-slice of the original byte slice "b" starting after the last byte of the identified line segment. If the length of the "rest" slice is 0, the entire byte slice "b" has been processed. The "segment" byte slice is the sub-slice of the input slice containing the identified line segment.

Given an empty byte slice "b", the function returns nil values.

Note that in accordance with UAX #14 LB3, the final segment will end with "mustBreak" set to true. You can choose to ignore this by checking if the length of the "rest" slice is 0.

Note also that this algorithm may break within grapheme clusters. This is addressed in Section 8.2 Example 6 of UAX #14. To avoid this, you can use the Step() function instead.

Example¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	b := []byte("First line.\nSecond line.")
	state := -1
	var (
		c         []byte
		mustBreak bool
	)
	for len(b) > 0 {
		c, b, mustBreak, state = uniseg.FirstLineSegment(b, state)
		fmt.Printf("(%s)", string(c))
		if mustBreak {
			fmt.Print("!")
		}
	}
}

Output:

(First )(line.
)!(Second )(line.)!

func FirstLineSegmentInString ¶

func FirstLineSegmentInString(str string, state int) (segment, rest string, mustBreak bool, newState int)

FirstLineSegmentInString is like FirstLineSegment() but its input and outputs are strings.

Example¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	str := "First line.\nSecond line."
	state := -1
	var (
		c         string
		mustBreak bool
	)
	for len(str) > 0 {
		c, str, mustBreak, state = uniseg.FirstLineSegmentInString(str, state)
		fmt.Printf("(%s)", c)
		if mustBreak {
			fmt.Println(" < must break")
		} else {
			fmt.Println(" < may break")
		}
	}
}

Output:

(First ) < may break
(line.
) < must break
(Second ) < may break
(line.) < must break

func FirstSentence ¶

func FirstSentence(b []byte, state int) (sentence, rest []byte, newState int)

FirstSentence returns the first sentence found in the given byte slice according to the rules of Unicode Standard Annex #29, Sentence Boundaries. This function can be called continuously to extract all sentences from a byte slice, as illustrated in the example below.

If you don't know the current state, for example when calling the function for the first time, you must pass -1. For consecutive calls, pass the state and rest slice returned by the previous call.

The "rest" slice is the sub-slice of the original byte slice "b" starting after the last byte of the identified sentence. If the length of the "rest" slice is 0, the entire byte slice "b" has been processed. The "sentence" byte slice is the sub-slice of the input slice containing the identified sentence.

Given an empty byte slice "b", the function returns nil values.

Example¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	b := []byte("This is sentence 1.0. And this is sentence two.")
	state := -1
	var c []byte
	for len(b) > 0 {
		c, b, state = uniseg.FirstSentence(b, state)
		fmt.Printf("(%s)\n", string(c))
	}
}

Output:

(This is sentence 1.0. )
(And this is sentence two.)

func FirstSentenceInString ¶

func FirstSentenceInString(str string, state int) (sentence, rest string, newState int)

FirstSentenceInString is like FirstSentence() but its input and outputs are strings.

Example¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	str := "This is sentence 1.0. And this is sentence two."
	state := -1
	var c string
	for len(str) > 0 {
		c, str, state = uniseg.FirstSentenceInString(str, state)
		fmt.Printf("(%s)\n", c)
	}
}

Output:

(This is sentence 1.0. )
(And this is sentence two.)

func FirstWord ¶

func FirstWord(b []byte, state int) (word, rest []byte, newState int)

FirstWord returns the first word found in the given byte slice according to the rules of Unicode Standard Annex #29, Word Boundaries. This function can be called continuously to extract all words from a byte slice, as illustrated in the example below.

If you don't know the current state, for example when calling the function for the first time, you must pass -1. For consecutive calls, pass the state and rest slice returned by the previous call.

The "rest" slice is the sub-slice of the original byte slice "b" starting after the last byte of the identified word. If the length of the "rest" slice is 0, the entire byte slice "b" has been processed. The "word" byte slice is the sub-slice of the input slice containing the identified word.

Given an empty byte slice "b", the function returns nil values.

Example¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	b := []byte("Hello, world!")
	state := -1
	var c []byte
	for len(b) > 0 {
		c, b, state = uniseg.FirstWord(b, state)
		fmt.Printf("(%s)\n", string(c))
	}
}

Output:

(Hello)
(,)
( )
(world)
(!)

func FirstWordInString ¶

func FirstWordInString(str string, state int) (word, rest string, newState int)

FirstWordInString is like FirstWord() but its input and outputs are strings.

Example¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	str := "Hello, world!"
	state := -1
	var c string
	for len(str) > 0 {
		c, str, state = uniseg.FirstWordInString(str, state)
		fmt.Printf("(%s)\n", c)
	}
}

Output:

(Hello)
(,)
( )
(world)
(!)

func GraphemeClusterCount ¶

func GraphemeClusterCount(s string) (n int)

GraphemeClusterCount returns the number of user-perceived characters (grapheme clusters) for the given string.

Example¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	n := uniseg.GraphemeClusterCount("🇩🇪🏳️‍🌈")
	fmt.Println(n)
}

Output:

func Step ¶

func Step(b []byte, state int) (cluster, rest []byte, boundaries int, newState int)

Step returns the first grapheme cluster (user-perceived character) found in the given byte slice. It also returns information about the boundary between that grapheme cluster and the one following it. There are three types of boundary information: word boundaries, sentence boundaries, and line breaks. This function is therefore a combination of FirstGraphemeCluster(), FirstWord(), FirstSentence(), and FirstLineSegment().

The "boundaries" return value can be evaluated as follows:

boundaries&MaskWord != 0: The boundary is a word boundary.
boundaries&MaskWord == 0: The boundary is not a word boundary.
boundaries&MaskSentence != 0: The boundary is a sentence boundary.
boundaries&MaskSentence == 0: The boundary is not a sentence boundary.
boundaries&MaskLine == LineDontBreak: You must not break the line at the boundary.
boundaries&MaskLine == LineMustBreak: You must break the line at the boundary.
boundaries&MaskLine == LineCanBreak: You may or may not break the line at the boundary.

This function can be called continuously to extract all grapheme clusters from a byte slice, as illustrated in the examples below.

If you don't know which state to pass, for example when calling the function for the first time, you must pass -1. For consecutive calls, pass the state and rest slice returned by the previous call.

Given an empty byte slice "b", the function returns nil values.

While slightly less convenient than using the Graphemes class, this function has much better performance and makes no allocations. It lends itself well to large byte slices.

Note that in accordance with UAX #14 LB3, the final segment will end with a mandatory line break (boundaries&MaskLine == LineMustBreak). You can choose to ignore this by checking if the length of the "rest" slice is 0.

Example (Graphemes)¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	b := []byte("🇩🇪🏳️‍🌈")
	state := -1
	var c []byte
	for len(b) > 0 {
		c, b, _, state = uniseg.Step(b, state)
		fmt.Println(string(c))
	}
}

Output:

🇩🇪
🏳️‍🌈

Example (LineBreaking)¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	b := []byte("First line.\nSecond line.")
	state := -1
	var (
		c          []byte
		boundaries int
	)
	for len(b) > 0 {
		c, b, boundaries, state = uniseg.Step(b, state)
		fmt.Print(string(c))
		if boundaries&uniseg.MaskLine == uniseg.LineCanBreak {
			fmt.Print("|")
		} else if boundaries&uniseg.MaskLine == uniseg.LineMustBreak {
			fmt.Print("‖")
		}
	}
}

Output:

First |line.
‖Second |line.‖

Example (Sentence)¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	b := []byte("This is sentence 1.0. And this is sentence two.")
	state := -1
	var (
		c          []byte
		boundaries int
	)
	for len(b) > 0 {
		c, b, boundaries, state = uniseg.Step(b, state)
		fmt.Print(string(c))
		if boundaries&uniseg.MaskSentence != 0 {
			fmt.Print("|")
		}
	}
}

Output:

This is sentence 1.0. |And this is sentence two.|

Example (Word)¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	b := []byte("Hello, world!")
	state := -1
	var (
		c          []byte
		boundaries int
	)
	for len(b) > 0 {
		c, b, boundaries, state = uniseg.Step(b, state)
		fmt.Print(string(c))
		if boundaries&uniseg.MaskWord != 0 {
			fmt.Print("|")
		}
	}
}

Output:

Hello|,| |world|!|

func StepString ¶

func StepString(str string, state int) (cluster, rest string, boundaries int, newState int)

StepString is like Step() but its input and outputs are strings.

Example (Graphemes)¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	str := "🇩🇪🏳️‍🌈"
	state := -1
	var c string
	for len(str) > 0 {
		c, str, _, state = uniseg.StepString(str, state)
		fmt.Println(c)
	}
}

Output:

🇩🇪
🏳️‍🌈

Example (LineBreaking)¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	str := "First line.\nSecond line."
	state := -1
	var (
		c          string
		boundaries int
	)
	for len(str) > 0 {
		c, str, boundaries, state = uniseg.StepString(str, state)
		fmt.Print(c)
		if boundaries&uniseg.MaskLine == uniseg.LineCanBreak {
			fmt.Print("|")
		} else if boundaries&uniseg.MaskLine == uniseg.LineMustBreak {
			fmt.Print("‖")
		}
	}
}

Output:

First |line.
‖Second |line.‖

Example (Sentence)¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	str := "This is sentence 1.0. And this is sentence two."
	state := -1
	var (
		c          string
		boundaries int
	)
	for len(str) > 0 {
		c, str, boundaries, state = uniseg.StepString(str, state)
		fmt.Print(c)
		if boundaries&uniseg.MaskSentence != 0 {
			fmt.Print("|")
		}
	}
}

Output:

This is sentence 1.0. |And this is sentence two.|

Example (Word)¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	str := "Hello, world!"
	state := -1
	var (
		c          string
		boundaries int
	)
	for len(str) > 0 {
		c, str, boundaries, state = uniseg.StepString(str, state)
		fmt.Print(c)
		if boundaries&uniseg.MaskWord != 0 {
			fmt.Print("|")
		}
	}
}

Output:

Hello|,| |world|!|

Types ¶

type Graphemes ¶

type Graphemes struct {
	// contains filtered or unexported fields
}

Graphemes implements an iterator over Unicode grapheme clusters, or user-perceived characters. While iterating, it also provides information about word boundaries, sentence boundaries, and line breaks.

After constructing the class via NewGraphemes(str) for a given string "str", Next() is called for every grapheme cluster in a loop until it returns false. Inside the loop, information about the grapheme cluster as well as boundary information is available via the various methods (see examples below).

Using this class to iterate over a string is convenient but it is much slower than using this package's Step() or StepString() functions or any of the other specialized functions starting with "First".

Example (Graphemes)¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	g := uniseg.NewGraphemes("🇩🇪🏳️‍🌈")
	for g.Next() {
		fmt.Println(g.Str())
	}
}

Output:

🇩🇪
🏳️‍🌈

Example (LineBreaking)¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	g := uniseg.NewGraphemes("First line.\nSecond line.")
	for g.Next() {
		fmt.Print(g.Str())
		if g.LineBreak() == uniseg.LineCanBreak {
			fmt.Print("|")
		} else if g.LineBreak() == uniseg.LineMustBreak {
			fmt.Print("‖")
		}
	}
}

Output:

First |line.
‖Second |line.‖

Example (Sentence)¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	g := uniseg.NewGraphemes("This is sentence 1.0. And this is sentence two.")
	for g.Next() {
		fmt.Print(g.Str())
		if g.IsSentenceBoundary() {
			fmt.Print("|")
		}
	}
}

Output:

This is sentence 1.0. |And this is sentence two.|

Example (Word)¶

Code:play

package main

import (
	"fmt"

	"github.com/rivo/uniseg"
)

func main() {
	g := uniseg.NewGraphemes("Hello, world!")
	for g.Next() {
		fmt.Print(g.Str())
		if g.IsWordBoundary() {
			fmt.Print("|")
		}
	}
}

Output:

Hello|,| |world|!|

func NewGraphemes ¶

func NewGraphemes(s string) *Graphemes

NewGraphemes returns a new grapheme cluster iterator.

func (*Graphemes) Bytes ¶

func (g *Graphemes) Bytes() []byte

Bytes returns a byte slice which corresponds to the current grapheme cluster. If the iterator is already past the end or Next() has not yet been called, nil is returned.

func (*Graphemes) IsSentenceBoundary ¶

func (g *Graphemes) IsSentenceBoundary() bool

IsSentenceBoundary returns true if a sentence ends after the current grapheme cluster.

func (*Graphemes) IsWordBoundary ¶

func (g *Graphemes) IsWordBoundary() bool

IsWordBoundary returns true if a word ends after the current grapheme cluster.

func (*Graphemes) LineBreak ¶

func (g *Graphemes) LineBreak() int

LineBreak returns whether the line can be broken after the current grapheme cluster. A value of LineDontBreak means the line may not be broken, a value of LineMustBreak means the line must be broken, and a value of LineCanBreak means the line may or may not be broken.

func (*Graphemes) Next ¶

func (g *Graphemes) Next() bool

Next advances the iterator by one grapheme cluster and returns false if no clusters are left. This function must be called before the first cluster is accessed.

func (*Graphemes) Positions ¶

func (g *Graphemes) Positions() (int, int)

Positions returns the interval of the current grapheme cluster as byte positions into the original string. The first returned value "from" indexes the first byte and the second returned value "to" indexes the first byte that is not included anymore, i.e. str[from:to] is the current grapheme cluster of the original string "str". If Next() has not yet been called, both values are 0. If the iterator is already past the end, both values are 1.

func (*Graphemes) Reset ¶

func (g *Graphemes) Reset()

Reset puts the iterator into its initial state such that the next call to Next() sets it to the first grapheme cluster again.

func (*Graphemes) Runes ¶

func (g *Graphemes) Runes() []rune

Runes returns a slice of runes (code points) which corresponds to the current grapheme cluster. If the iterator is already past the end or Next() has not yet been called, nil is returned.

func (*Graphemes) Str ¶

func (g *Graphemes) Str() string

Str returns a substring of the original string which corresponds to the current grapheme cluster. If the iterator is already past the end or Next() has not yet been called, an empty string is returned.

Source Files ¶

doc.go eastasianwidth.go grapheme.go graphemeproperties.go graphemerules.go line.go lineproperties.go linerules.go properties.go sentence.go sentenceproperties.go sentencerules.go step.go word.go wordproperties.go wordrules.go

Version: v0.3.1
Published: Jul 28, 2022
Platform: windows/amd64
Imports: 1 packages
Last checked: now –

Tools for package owners.

?	: This menu
/	: Search site
f	: Jump to identifier
g then g	: Go to top of page
g then b	: Go to end of page
G	: Go to end of page
g then i	: Go to index
g then e	: Go to examples

package uniseg

Grapheme Clusters

Word Boundaries

Sentence Boundaries

Line Breaking

Index ¶

Examples ¶

Constants ¶

Functions ¶

func FirstGraphemeCluster ¶

func FirstGraphemeClusterInString ¶

func FirstLineSegment ¶

func FirstLineSegmentInString ¶

func FirstSentence ¶

func FirstSentenceInString ¶

func FirstWord ¶

func FirstWordInString ¶

func GraphemeClusterCount ¶

func Step ¶

func StepString ¶

Types ¶

type Graphemes ¶

func NewGraphemes ¶

func (*Graphemes) Bytes ¶

func (*Graphemes) IsSentenceBoundary ¶

func (*Graphemes) IsWordBoundary ¶

func (*Graphemes) LineBreak ¶

func (*Graphemes) Next ¶

func (*Graphemes) Positions ¶

func (*Graphemes) Reset ¶

func (*Graphemes) Runes ¶

func (*Graphemes) Str ¶

Source Files ¶

Jump to identifier

Keyboard shortcuts