# Tall Skinny QR (TSQR) Matrix Factorization Using MapReduce

This example shows how to compute a tall skinny QR (TSQR) factorization using `mapreduce`. It demonstrates how to chain `mapreduce` calls to perform multiple iterations of factorizations, and uses the `info` argument of the map function to compute numeric keys.

### Prepare Data

Create a datastore using the `airlinesmall.csv` data set. This 12-megabyte data set contains 29 columns of flight information for several airline carriers, including arrival and departure times. In this example, the variables of interest are `ArrDelay` (flight arrival delay), `DepDelay` (flight departure delay) and `Distance` (total flight distance).

```ds = tabularTextDatastore('airlinesmall.csv', 'TreatAsMissing', 'NA'); ds.ReadSize = 1000; ds.SelectedVariableNames = {'ArrDelay', 'DepDelay', 'Distance'};```

The datastore treats `'NA'` values as missing and replaces the missing values with `NaN` values by default. The `ReadSize` property lets you specify how to partition the data into blocks. Additionally, the `SelectedVariableNames` property allows you to work with only the specified variables of interest, which you can verify using `preview`.

`preview(ds)`
```ans=8×3 table ArrDelay DepDelay Distance ________ ________ ________ 8 12 308 8 1 296 21 20 480 13 12 296 4 -1 373 59 63 308 3 -2 447 11 -1 954 ```

### Chain MapReduce Calls

The implementation of the multi-iteration TSQR algorithm needs to chain consecutive `mapreduce` calls. To demonstrate the general chaining design pattern, this example uses two `mapreduce` iterations. The output from the map function calls is passed into a large set of reducers, and then the output of these reducers becomes the input for the next `mapreduce` iteration.

### First MapReduce Iteration

In the first iteration, the map function, `tsqrMapper`, receives one block (the ith) of data, which is a table of size ${N}_{i}×3$. The mapper computes the $R$ matrix of this block of data and stores it as an intermediate result. Then, `mapreduce` aggregates the intermediate results by unique key before sending them to the reduce function. Thus, `mapreduce` sends all intermediate $R$ matrices with the same key to the same reducer.

Since the reducer uses `qr`, which is an in-memory MATLAB® function, it's best to first make sure that the $R$ matrices fit in memory. This example divides the dataset into eight partitions. The `mapreduce` function reads the data in blocks and passes the data along with some meta information to the map function. The `info` input argument is the second input to the map function and it contains the read offset and file size information that are necessary to generate the key,

``` key = ceil(offset/fileSize/numPartitions). ```

Display the map function file.

```function tsqrMapper(data, info, intermKVStore) x = data{:,:}; x(any(isnan(x),2),:) = [];% Remove missing values [~, r] = qr(x,0); % intermKey = randi(4); % random integer key for partitioning intermediate results intermKey = computeKey(info, 8); add(intermKVStore,intermKey, r); function key = computeKey(info, numPartitions) fileSize = info.FileSize; % total size of the underlying data file partitionSize = fileSize/numPartitions; % size in bytes of each partition offset = info.Offset; % offset in bytes of the current read key = ceil(offset/partitionSize); end end ```

The reduce function receives a list of the intermediate $R$ matrices, vertically concatenates them, and computes the $R$ matrix of the concatenated matrix.

Display the reduce function file.

```function tsqrReducer(intermKey, intermValIter, outKVStore) x = []; while (intermValIter.hasnext) x = [x;intermValIter.getnext]; end % Note that this approach assumes the concatenated intermediate values fit % in memory. Consider increasing the number of reduce tasks (increasing the % number of partitions in the tsqrMapper) and adding more iterations if it % does not fit in memory. [~, r] =qr(x,0); add(outKVStore,intermKey,r); end ```

Use `mapreduce` to apply the map and reduce functions to the datastore, `ds`.

`outds1 = mapreduce(ds, @tsqrMapper, @tsqrReducer);`
```******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 10% Reduce 0% Map 20% Reduce 0% Map 30% Reduce 0% Map 40% Reduce 0% Map 50% Reduce 0% Map 60% Reduce 0% Map 70% Reduce 0% Map 80% Reduce 0% Map 90% Reduce 0% Map 100% Reduce 0% Map 100% Reduce 11% Map 100% Reduce 22% Map 100% Reduce 33% Map 100% Reduce 44% Map 100% Reduce 56% Map 100% Reduce 67% Map 100% Reduce 78% Map 100% Reduce 89% Map 100% Reduce 100% ```

`mapreduce` returns an output datastore, `outds1`, with files in the current folder.

### Second MapReduce Iteration

The second iteration uses the output of the first iteration, `outds1`, as its input. This iteration uses an identity mapper, `identityMapper`, which simply copies over the data using a single key, `'Identity'`.

Display the identity mapper file.

```function identityMapper(data, info, intermKVStore) % This mapper function simply copies the data and add them to the % intermKVStore as intermediate values. x = data.Value{:,:}; add(intermKVStore,'Identity', x); end ```

The reducer function is the same in both iterations. The use of a single key by the map function means that `mapreduce` only calls the reduce function once in the second iteration.

Use `mapreduce` to apply the identity mapper and the same reducer to the output from the first `mapreduce` call.

`outds2 = mapreduce(outds1, @identityMapper, @tsqrReducer);`
```******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 11% Reduce 0% Map 22% Reduce 0% Map 33% Reduce 0% Map 44% Reduce 0% Map 55% Reduce 0% Map 66% Reduce 0% Map 77% Reduce 0% Map 88% Reduce 0% Map 100% Reduce 0% Map 100% Reduce 100% ```

### View Results

Read the final results from the output datastore.

```r = readall(outds2); r.Value{:}```
```ans = 3×3 105 × 0.1091 0.0893 0.5564 0 -0.0478 -0.4890 0 0 3.0130 ```

### Local Functions

Listed here are the map and reduce functions that `mapreduce` applies to the data.

```function tsqrMapper(data, info, intermKVStore) x = data{:,:}; x(any(isnan(x),2),:) = [];% Remove missing values [~, r] = qr(x,0); % intermKey = randi(4); % random integer key for partitioning intermediate results intermKey = computeKey(info, 8); add(intermKVStore,intermKey, r); function key = computeKey(info, numPartitions) fileSize = info.FileSize; % total size of the underlying data file partitionSize = fileSize/numPartitions; % size in bytes of each partition offset = info.Offset; % offset in bytes of the current read key = ceil(offset/partitionSize); end end %------------------------------------------------------------------------------- function tsqrReducer(intermKey, intermValIter, outKVStore) x = []; while (intermValIter.hasnext) x = [x;intermValIter.getnext]; end % Note that this approach assumes the concatenated intermediate values fit % in memory. Consider increasing the number of reduce tasks (increasing the % number of partitions in the tsqrMapper) and adding more iterations if it % does not fit in memory. [~, r] =qr(x,0); add(outKVStore,intermKey,r); end %------------------------------------------------------------------------------- function identityMapper(data, info, intermKVStore) % This mapper function simply copies the data and add them to the % intermKVStore as intermediate values. x = data.Value{:,:}; add(intermKVStore,'Identity', x); end %-------------------------------------------------------------------------------```

### Reference

1. Paul G. Constantine and David F. Gleich. 2011. Tall and skinny QR factorizations in MapReduce architectures. In Proceedings of the Second International Workshop on MapReduce and Its Applications (MapReduce '11). ACM, New York, NY, USA, 43-50. DOI=10.1145/1996092.1996103 https://doi.acm.org/10.1145/1996092.1996103