Chunk Types

Chunk expressions let you work with all of these chunk types:

Type Definition
characters individual characters within text
words words separated by any amount of white space (spaces, tabs, returns) within text
lines paragraphs separated by any of several standard line endings (CR, LF, CRLF, etc.)
text items portions of text separated by commas
list items the individual items in a list
bytes the bytes within binary data
occurrences the text matches of a defined pattern
matches the text matches and text range of a defined pattern and its capture groups

In addition, you can specify custom delimiters to be used in identifying text items, lines, and words, giving even greater functionality. These three text chunk types each have distinctive types of delimiters: text items are delimited by a single text string, lines are delimited by any of a list of text strings, and words are delimited by any number and combination of characters from a set of characters.

Characters

The simplest type of chunk is the character chunk. A character is simply one character of text, including both visible and invisible characters (invisible characters include control characters such as tab, carriage return, and linefeed characters). The word character may be abbreviated as char.

put "The quick brown fox" into animal

put character 1 of animal--> T

put the last char of animal --> x

put chars 3 to 7 of animal --> e qui

Words

A single word is defined as a sequence of characters not containing any whitespace characters, or a sequence of characters contained in quotation marks. A range of words includes all characters from the first word specified through the last word specified, including all intervening words and whitespace. Whitespace characters are spaces, tabs, and returns (newlines).

put "Sometimes you feel like a nut; sometimes you don’t." into slogan

put the second word of slogan --> you

put word 6 of slogan --> nut;

put words 1 to 3 of slogan --> Sometimes you feel

Note that quoted phrases are ordinarily treated as a single word, including the quotation marks:

put <<Mary said "Good day" to John.>> into sentence

put the third word of sentence --> "Good day"

Related Local and Global Properties

SenseTalk includes local and global properties you can use to govern aspects of working with words in chunks. The set of characters that are used to identify words can be changed to something other than Space, Tab, and Return by setting the wordDelimiter local property or the defaultWordDelimiter global property. The quote characters used to identify a quoted word (or whether word quoting should be disabled completely) can be specified with the wordQuotes local property or the defaultWordQuotes global property.

These local properties are defined on Local and Global Properties for Chunk Expressions:

Lines

A line chunk expression allows you to specify one or more lines or paragraphs of text within the subject text, where lines are initially defined as the characters between any of the standard line ending characters.

put "line 1" & return & "line 2" & return & "line 3" into text

put the second line of text --> line 2

put line 6 of text --> ""

put lines 2 to 3 of text --> line 2 & return & line 3

Related Local and Global Properties

SenseTalk includes two properties you can use to govern aspects of working with lines in chunks. The set of line endings (delimiter strings) that define what a line is can be changed to something other than the default by setting the lineDelimiter local property. Setting the lineDelimiter to empty causes it to return to the default list.

the defaultLineDelimiter global property defines the default set of line delimiters. This property is initially set to: CRLF, Return, CarriageReturn, LineSeparator, ParagraphSeparator.

These properties are defined on Local and Global Properties for Chunk Expressions:

Text Items

An item within text is usually defined as the portion of text between commas:

put "A man, a plan, a canal. Panama!" into palindrome

put item 2 of palindrome --> " a plan"

The separation (delimiter) character can be specified as something other than a comma by setting the itemDelimiter property. the itemDelimiter's default value is determined by the defaultItemDelimiter global property. These two properties are defined on Local and Global Properties for Chunk Expressions:

List Items

The word items can also refer to the elements in a list.

put ["red", "green", "blue"] into colors

put item 2 of colors -- green

SenseTalk decides whether item refers to text items or list items depending on whether the value is a list or not. When referring to items within a value which is a list, SenseTalk will automatically assume the reference is to list items, not text items. However, if the itemDelimiter is set to “” (empty), items will refer to list items rather than text items. You may explicitly refer to list items or text items instead of the more generic items if you need to control the way items are treated. This is especially important if you are trying to create a list by putting values into individual items, like this:

put 1 into myText -- 1

put 2 into item 2 of myText

put mytext -- 1,2

The code above will generate a text string, with the middle character being the itemDelimiter (unless the itemDelimiter has been set to empty). To generate a list instead of text, specify list item:

put 1 into myList -- 1

put 2 into list item 2 of myList

put myList -- [1,2]

See Lists and Property Lists for more information on working with lists.

Bytes

A byte chunk can be used to refer to a portion of binary data.

set the defaultDataFormat to "auto"

put <3f924618> into binaryData

put byte 2 of binaryData -- <92>

See Binary Data Manipulation, for more information on byte chunks.

Occurrences

The words occurrence and occurrences let you access a pattern match as chunks of a string and return the matched text. Use occurrence to access a specific single occurrence of a pattern within a string, and use occurrences to return a list of occurrences.

You can use instance as a synonym for occurrence, and instances in place of occurrences in all cases.

Example:

put occurrence 4 of <digit> in "V2.7 for 4/3/18" --> 3

Example:

set proverb to "If wishes were horses, beggars would ride"

set wordEndingWithS to <start of word, word chars, word ending with "s">

put occurrence 2 of wordEndingWithS in proverb -- horses

Requesting a range of occurrences returns a list of values rather than a substring of the source string.

Example:

Used with the Set commands in the previous example

put instances 1 to 3 of wordEndingWithS in proverb -- [wishes,horses,beggars]

Example:

put instances 3 to 5 of <digit> in "V2.7 for 4/3/18" -- [4,3,1]

put the first 3 occurrences of <max digits> in "42-16gh9-88" -- [42,16,9]

For information about using patterns, see SenseTalk Pattern Language Basics.

Matches

The matches keyword lets you access a pattern as chunks of a string. The value returned is a match property list that contains the full text of the match as the text property and the range of the match as the text_range property.

Requesting a range by using matches returns a list of property lists, which includes one property list for each match of the pattern.

put the second match of <3 digits> in "987654321" -- {text:"654", text_range:"4" to "6"}

put the last 2 matches of <max digits> in "42-16gh9-88" -- [{text:"9", text_range:"8" to "8"},{text:"88", text_range:"10" to "11"}]

For information about using patterns, see SenseTalk Pattern Language Basics.

Custom Chunks

The standard word, line, and text item chunks are useful for many things just as they are. Sometimes you may have text in specific formats that you would like to divide in other ways, however. For example, many programs can produce data files containing several values separated by tab characters on each line of the file.

One way to work with such data would be to set the itemDelimiter to tab and then access the items of each line. But suppose that each tab-separated item contains several values separated by commas. To access these values individually would require switching the itemDelimiter back and forth between tab and comma.

SenseTalk offers an easier alternative for such cases, by specifying the delimiter to be used as part of each chunk, using the phrase delimited by:

add 1 to item 3 delimited by "," of item 5 delimited by tab \

of line 18 of file complexDataFile

The same syntax may be used with line chunks if you like:

get line 6 delimited by creturn of oddLineBreakText

The delimiters used to separate text items and lines are not restricted to a single character:

put item 2 delimited by "<>" of "12<>A19<>X" -- A19

Custom delimiters are also allowed with word chunks, but the behavior is different than with items and lines. Words are normally separated by spaces, tabs, and line breaks. Any number of these “whitespace” characters may appear in sequence between two words. If you specify a custom delimiter for a word chunk, the “words” will be delimited by any number and combination of the characters contained in the delimiter string you supply:

put word 2 delimited by "<>" of "12><<>>A19><>X" -- A19

The following example may help to illustrate the difference between the use of custom delimiters for line chunks (which treat each delimiter string found as a separate chunk) and for word chunks (which treat each sequence of delimiter characters as a single word break):

put each line delimited by ["<",">"] of "12><<>>A19><>X" -- ["12","","","","","A19","","","X"]

put each word delimited by "<>" of "12><<>>A19><>X" -- ["12","A19","X"]

 

This topic was last updated on August 19, 2021, at 03:30:51 PM.

Eggplant icon Eggplantsoftware.com | Documentation Home | User Forums | Support | Copyright © 2022 Eggplant