# splitlabels

Find indices to split labels according to specified proportions

## Syntax

## Description

Use this function when you are working on a machine or deep learning classification problem and you want to split a dataset into training, testing, and validation sets that hold the same proportion of label values.

specifies additional input arguments using name-value pairs. For example,
`idxs`

= splitlabels(___,`Name,Value`

)`'UnderlyingDatastoreIndex',3`

splits the labels only in the third
underlying datastore of a combined datastore.

## Examples

### Split Vowels

Read William Shakespeare's sonnets with the `fileread`

function. Extract all the vowels from the text and convert them to lowercase.

sonnets = fileread("sonnets.txt"); vowels = lower(sonnets(regexp(sonnets,"[AEIOUaeiou]")))';

Count the number of instances of each vowel.

cnts = countlabels(vowels)

`cnts=`*5×3 table*
Label Count Percent
_____ _____ _______
a 4940 18.368
e 9028 33.569
i 4895 18.201
o 5710 21.232
u 2321 8.6302

Split the vowels into a training set containing 500 instances of each vowel, a validation set containing 300, and a testing set with the rest. All vowels are represented with equal weights in the first two sets but not in the third.

spltn = splitlabels(vowels,[500 300]); for kj = 1:length(spltn) cntsn{kj} = countlabels(vowels(spltn{kj})); end cntsn{:}

`ans=`*5×3 table*
Label Count Percent
_____ _____ _______
a 500 20
e 500 20
i 500 20
o 500 20
u 500 20

`ans=`*5×3 table*
Label Count Percent
_____ _____ _______
a 300 20
e 300 20
i 300 20
o 300 20
u 300 20

`ans=`*5×3 table*
Label Count Percent
_____ _____ _______
a 4140 18.083
e 8228 35.94
i 4095 17.887
o 4910 21.447
u 1521 6.6437

Split the vowels into a training set containing 50% of the instances, a validation set containing another 30%, and a testing set with the rest. All vowels are represented with the same weight across all three sets.

spltp = splitlabels(vowels,[0.5 0.3]); for kj = 1:length(spltp) cntsp{kj} = countlabels(vowels(spltp{kj})); end cntsp{:}

`ans=`*5×3 table*
Label Count Percent
_____ _____ _______
a 2470 18.367
e 4514 33.566
i 2448 18.203
o 2855 21.23
u 1161 8.6333

`ans=`*5×3 table*
Label Count Percent
_____ _____ _______
a 1482 18.371
e 2708 33.569
i 1468 18.198
o 1713 21.235
u 696 8.6277

`ans=`*5×3 table*
Label Count Percent
_____ _____ _______
a 988 18.368
e 1806 33.575
i 979 18.2
o 1142 21.231
u 464 8.6261

### Split Vowels and Consonants

Read William Shakespeare's sonnets with the `fileread`

function. Remove all nonalphabetic characters from the text and convert to lowercase.

sonnets = fileread("sonnets.txt"); letters = lower(sonnets(regexp(sonnets,"[A-z]")))';

Classify the letters as consonants or vowels and create a table with the results. Show the first few rows of the table.

type = repmat("consonant",size(letters)); type(regexp(letters',"[aeiou]")) = "vowel"; T = table(letters,type,'VariableNames',["Letter" "Type"]); head(T)

Letter Type ______ ___________ t "consonant" h "consonant" e "vowel" s "consonant" o "vowel" n "consonant" n "consonant" e "vowel"

Display the number of instances of each category.

cnt = countlabels(T,'TableVariable',"Type")

`cnt=`*2×3 table*
Type Count Percent
_________ _____ _______
consonant 46516 63.365
vowel 26894 36.635

Split the table into two sets, one containing 60% of the consonants and vowels and the other containing 40%. Display the number of instances of each category.

splt = splitlabels(T,0.6,'TableVariable',"Type"); sixty = countlabels(T(splt{1},:),'TableVariable',"Type")

`sixty=`*2×3 table*
Type Count Percent
_________ _____ _______
consonant 27910 63.366
vowel 16136 36.634

forty = countlabels(T(splt{2},:),'TableVariable',"Type")

`forty=`*2×3 table*
Type Count Percent
_________ _____ _______
consonant 18606 63.363
vowel 10758 36.637

Split the table into two sets, one containing 60% of each particular letter and the other containing 40%. Exclude the letter *y*, which sometimes acts as a consonant and sometimes as a vowel. Display the number of instances of each category.

splt = splitlabels(T,0.6,'Exclude',"y"); sixti = countlabels(T(splt{1},:),'TableVariable',"Type")

`sixti=`*2×3 table*
Type Count Percent
_________ _____ _______
consonant 26719 62.346
vowel 16137 37.654

forti = countlabels(T(splt{2},:),'TableVariable',"Type")

`forti=`*2×3 table*
Type Count Percent
_________ _____ _______
consonant 17813 62.349
vowel 10757 37.651

Split the table into two sets of the same size. Include only the letters *e* and *s*. Randomize the sets.

halves = splitlabels(T,0.5,'randomized','Include',["e" "s"]); cnt = countlabels(T(halves{1},:))

`cnt=`*2×3 table*
Letter Count Percent
______ _____ _______
e 4514 64.385
s 2497 35.615

### Split Data in Datastore

Create a dataset that consists of 100 Gaussian random numbers. Label 40 of the numbers as `A`

, 30 as `B`

, and 30 as `C`

. Store the data in a combined datastore containing two datastores. The first datastore has the data and the second datastore contains the labels.

dsData = arrayDatastore(randn(100,1)); dsLabels = arrayDatastore([repmat("A",40,1); repmat("B",30,1); repmat("C",30,1)]); dsDataset = combine(dsData,dsLabels); cnt = countlabels(dsDataset,'UnderlyingDatastoreIndex',2)

`cnt=`*3×3 table*
Label Count Percent
_____ _____ _______
A 40 40
B 30 30
C 30 30

Split the data set into two sets, one containing 60% of the numbers and the other with the rest.

splitIndices = splitlabels(dsDataset,0.6,'UnderlyingDatastoreIndex',2); dsDataset1 = subset(dsDataset,splitIndices{1}); cnt1 = countlabels(dsDataset1,'UnderlyingDatastoreIndex',2)

`cnt1=`*3×3 table*
Label Count Percent
_____ _____ _______
A 24 40
B 18 30
C 18 30

```
dsDataset2 = subset(dsDataset,splitIndices{2});
cnt2 = countlabels(dsDataset2,'UnderlyingDatastoreIndex',2)
```

`cnt2=`*3×3 table*
Label Count Percent
_____ _____ _______
A 16 40
B 12 30
C 12 30

## Input Arguments

`lblsrc`

— Input label source

categorical vector | string vector | logical vector | numeric vector | cell array | table | datastore | `CombinedDatastore`

object

Input label source, specified as one of these:

A categorical vector.

A string vector or a cell array of character vectors.

A numeric vector or a cell array of numeric scalars.

A logical vector or a cell array of logical scalars.

A table with variables containing any of the previous data types.

A datastore whose

`readall`

function returns any of the previous data types.A

`CombinedDatastore`

object containing an underlying datastore whose`readall`

function returns any of the previous data types. In this case, you must specify the index of the underlying datastore that has the label values.

`lblsrc`

must contain labels that can be converted to a vector with a discrete set of categories.

**Example: **```
lblsrc = categorical(["B" "C" "A" "E" "B" "A" "A" "B" "C" "A"],["A" "B" "C"
"D"])
```

creates the label source as a ten-sample categorical vector with
four categories: `A`

, `B`

, `C`

, and
`D`

.

**Example: **`lblsrc = [0 7 2 5 11 17 15 7 7 11]`

creates the label source
as a ten-sample numeric vector.

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

| `logical`

| `char`

| `string`

| `table`

| `cell`

| `categorical`

`p`

— Proportions or numbers of labels

integer scalar | scalar in (0, 1) | vector of integers | vector of fractions

Proportions or numbers of labels, specified as an integer scalar, a scalar in the range (0, 1), a vector of integers, or a vector of fractions.

If

`p`

is a scalar,`splitlabels`

finds two splitting index sets and returns a two-element cell array in`idxs`

.If

`p`

is an integer, the first element of`idxs`

contains a vector of indices pointing to the first`p`

values of each label category. The second element of`idxs`

contains indices pointing to the remaining values of each label category.If

`p`

is a value in the range (0, 1) and`lblsrc`

has*K*elements in the_{i}*i*th category, the first element of`idxs`

contains a vector of indices pointing to the first`p`

×*K*values of each label category. The second element of_{i}`idxs`

contains the indices of the remaining values of each label category.

If

`p`

is a vector with*N*elements of the form*p*_{1},*p*_{2}, …,*p*,_{N}`splitlabels`

finds*N*+ 1 splitting index sets and returns an (*N*+ 1)-element cell array in`idxs`

.If

`p`

is a vector of integers, the first element of`idxs`

is a vector of indices pointing to the first*p*_{1}values of each label category, the next element of`idxs`

contains the next*p*_{2}values of each label category, and so on. The last element in`idxs`

contains the remaining indices of each label category.If

`p`

is a vector of fractions and`lblsrc`

has*K*elements of the_{i}*i*th category, the first element of`idxs`

is a vector of indices concatenating the first*p*_{1}×*K*values of each category, the next element of_{i}`idxs`

contains the next*p*_{2}×*K*_{i}values of each label category, and so on. The last element in`idxs`

contains the remaining indices of each label category.

**Note**

If

`p`

contains fractions, then the sum of its elements must not be greater than one.If

`p`

contains numbers of label values, then the sum of its elements must not be greater than the smallest number of labels available for any of the label categories.

**Data Types: **`single`

| `double`

| `int8`

| `int16`

| `int32`

| `int64`

| `uint8`

| `uint16`

| `uint32`

| `uint64`

### Name-Value Arguments

Specify optional pairs of arguments as
`Name1=Value1,...,NameN=ValueN`

, where `Name`

is
the argument name and `Value`

is the corresponding value.
Name-value arguments must appear after other arguments, but the order of the
pairs does not matter.

*
Before R2021a, use commas to separate each name and value, and enclose*
`Name`

*in quotes.*

**Example: **`'TableVariable',"AreaCode",'Exclude',["617" "508"]`

specifies
that the function split labels based on telephone area code and exclude numbers from Boston
and Natick.

`Include`

— Labels to include in index sets

vector of label categories | cell array of label categories

Labels to include in the index sets, specified as a vector or cell array of label
categories. The categories specified with this argument must be of the same type as
the labels in `lblsrc`

. Each category in the vector or cell array
must match one of the label categories in `lblsrc`

.

`Exclude`

— Labels to exclude from index sets

vector of label categories | cell array of label categories

Labels to exclude from the index sets, specified as a vector or cell array of
label categories. The categories specified with this argument must be of the same type
as the labels in `lblsrc`

. Each category in the vector or cell
array must match one of the label categories in `lblsrc`

.

`TableVariable`

— Table variable to read

first table variable (default) | character vector | string scalar

Table variable to read, specified as a character vector or string scalar. If this argument is
not specified, then `splitlabels`

uses the first table
variable.

`UnderlyingDatastoreIndex`

— Underlying datastore index

integer scalar

Underlying datastore index, specified as an integer scalar. This argument applies when
`lblsrc`

is a `CombinedDatastore`

object. `splitlabels`

counts the labels in the datastore obtained
using the `UnderlyingDatastores`

property of
`lblsrc`

.

## Output Arguments

`idxs`

— Splitting indices

cell array

Splitting indices, returned as a cell array.

## Version History

**Introduced in R2021a**

## See Also

`countlabels`

(Signal Processing Toolbox) | `filenames2labels`

(Signal Processing Toolbox) | `folders2labels`

(Signal Processing Toolbox)

## Open Example

You have a modified version of this example. Do you want to open this example with your edits?

## MATLAB Command

You clicked a link that corresponds to this MATLAB command:

Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.

# Select a Web Site

Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .

You can also select a web site from the following list:

## How to Get Best Site Performance

Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.

### Americas

- América Latina (Español)
- Canada (English)
- United States (English)

### Europe

- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)

- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)