# Tall-Scale Matrix Factorization on TPUs

Matrix factorization is one of many oldest, yet quiet widely archaic, ways for learning tricks on how to imply objects equivalent to songs or motion photographs from user ratings. In its classic catch, it approximates a tidy, sparse (i.e., largely empty) matrix of user-merchandise interactions with a fabricated from two smaller, denser matrices representing learned merchandise and user substances. These dense matrices, in flip, can be archaic to imply objects to a user with which they have not interacted sooner than.

No matter its algorithmic simplicity, matrix factorization can quiet raise out aggressive efficiency in recommender benchmarks. Alternating least squares (ALS), and particularly its implicit variation, is a chief algorithm to study the parameters of matrix factorization. ALS is identified for its excessive effectivity due to it scales linearly in the need of rows, columns and non-zeros. Hence, this algorithm is amazingly effectively matched for tidy-scale challenges. But, for extraordinarily tidy real-world matrix factorization datasets, a single machine implementation would now not suffice, and so, it might possibly possibly possibly possibly require a tidy disbursed machine. Many of the disbursed implementations of matrix factorization that use ALS leverage off-the-shelf CPU devices, and rightfully so, resulting from the inherently sparse nature of the wretchedness (the input matrix is largely empty).

On the opposite hand, contemporary success of deep learning, which has exhibited rising computational capability, has spurred a novel wave of compare and progress on hardware accelerators equivalent to Tensor Processing Gadgets (TPUs). TPUs just like the funds for domain particular hardware speedups, particularly for use conditions like deep learning, which involves a tidy need of dense matrix multiplications. In particular, they permit significant speedups for aged recordsdata-parallel workloads, equivalent to practicing fashions with Stochastic Gradient Descent (SGD) in SPMD (single program multiple recordsdata) vogue. The SPMD scheme has won recognition in computations like practicing neural networks with gradient descent algorithms, and could unbiased be archaic for both recordsdata-parallel and mannequin-parallel computations, the place we distribute parameters of the mannequin across on hand devices. Nonetheless, whereas TPUs like been tremendously lovely for recommendations based fully on SGD, it is now not straight away decided if a excessive efficiency implementation of ALS, which requires a tidy need of disbursed *sparse* matrix multiplies, can be developed for a tidy-scale cluster of TPU devices.

In “ALX: Tall Scale Matrix Factorization on TPUs”, we explore a disbursed ALS produce that makes environment friendly use of the TPU structure and could scale effectively to matrix factorization concerns of the inform of billions of rows and columns by scaling the need of on hand TPU cores. The style we imply leverages a mix of mannequin and recordsdata parallelism, the place each TPU core both shops a fragment of the embedding desk and trains over a particular nick of recordsdata, grouped in mini-batches. In confide in spur future compare on tidy-scale matrix factorization recommendations and as an instance the scalability properties of our have faith implementation, we also built and released an valid world net link prediction dataset known as WebGraph.

**Dense Batching for Improved Effectivity**

We designed ALX specifically for TPUs, exploiting queer properties of TPU structure whereas overcoming a pair of attractive obstacles. For occasion, each TPU core has itsy-bitsy memory and restricts all tensors to love a static form, but each example in a mini-batch can like a wildly diversified need of objects (i.e., inputs can be long and sparse). To resolve this, we break exceedingly long examples into multiple smaller examples of the the same form, a route of known as *dense batching*. Extra significant components about dense batching can be sign in our paper.

Illustrating example of how sparse batches are densified to magnify effectivity on TPUs. |

**Uniform Sharding of Embedding Tables**

With the batching wretchedness solved, we subsequent are making an strive to factorize a sparse matrix into two dense embedding matrices (e.g., user and merchandise embeddings) such that the resulting dot product of embeddings approximate the brand new sparse matrix — this helps us infer predictions for *all* the positions from the brand new matrix, including these that had been empty, which can be archaic to imply objects with which users haven’t interacted. Each the resulting embedding tables (W and H in the figure below) can potentially be too tidy to slot in a single TPU core, thus requiring a disbursed practicing setup for many tidy-scale use conditions.

Most aged attempts of disbursed matrix factorization use a parameter server structure the place the mannequin parameters are saved on highly on hand servers, and the practicing recordsdata is processed in parallel by employees which could possibly possibly be fully up to the mark of the educational activity. In our case, since each TPU core has the same compute and memory, it be wasteful to excellent use either memory for storing mannequin parameters or compute for practicing. Thus, we designed our machine such that each core is archaic to raise out both.

Illustrative example of factorizing a sparse matrix Y into two dense embedding matrices W and H. |

In ALX, we uniformly divide both embedding tables, thus fully exploiting both the size of disbursed memory on hand and the devoted low-latency interconnects between TPUs. This is extremely environment friendly for extraordinarily tidy embedding tables and ends in appropriate efficiency for disbursed catch and scatter operations.

Uniform sharding of both embedding tables (W and H) across TPU cores (in blue). |

**WebGraph**

Since attainable applications could unbiased like very tidy recordsdata devices, scalability is potentially an significant replacement for pattern in matrix factorization. To that give up, we also beginning a tidy real-world net link prediction dataset known as WebGraph. This dataset can be without problems modeled as a matrix factorization wretchedness the place rows and columns are offer and destination hyperlinks, respectively, and the duty is to predict destination hyperlinks from each offer link. We use WebGraph as an instance the scaling properties of ALX.

The WebGraph dataset change into as soon as generated from a single scoot performed by CommonCrawl in 2021 the place we strip the entirety and withhold excellent the link->outlinks recordsdata. For the reason that efficiency of a factorization scheme is dependent upon the properties of the underlying graph, we created six variations of WebGraph, each diversified in the sparsity sample and locale, to question how effectively ALS performs on each.

- To query locale-particular graphs, we filter based fully on two high stage domains: ‘de’ and ‘in’, each producing a graph with an inform of magnitude fewer nodes.
- These graphs can quiet like arbitrary sparsity patterns and dangling hyperlinks. Thus we extra filter the nodes in each graph to love at the very least either 10 or 50 inlinks and outlinks.

For easy catch admission to, we like got made these on hand as a Tensorflow Dataset equipment. For reference, the excellent version, WebGraph-sparse, has larger than 365M nodes and 30B edges. We catch and publish both practicing and testing splits for overview capabilities.

**Outcomes**

We carefully tune the machine and quality parameters of ALX. Based mostly fully mostly on our observations linked to precision and need of linear solvers. We noticed that by carefully selecting the precision for storage of the embedding tables (bfloat16) and for the input to the linear solvers (lumber with the drift32), we had been able to halve the memory required for the embeddings whereas quiet avoiding concerns arising from decrease precision values in the middle of the solve stage. For our linear solvers, we chosen conjugate gradients, which we stumbled on to be the quickest across the board on TPUs. We use embeddings of dimension 128 and prepare the mannequin for 16 epochs. In our abilities, hyperparameter tuning over both norm penalty (λ) and unobserved weight (α) has been vital for appropriate make a selection metrics as proven in the desk below.

Outcomes bought by working ALX on all variations of WebGraph dataset. Clutch values of 1.0 denote ideal make a selection. |

**Scaling Diagnosis**

For the reason that input recordsdata are processed in parallel across TPU cores, increasing the need of cores decreases practicing time, ideally in a linear vogue. But at the the same time, a bigger need of cores requires more network verbal change (resulting from the sharded embedding tables). On account of excessive-shuffle interconnects, this overhead can be negligible for a itsy-bitsy need of cores, but because the need of cores will enhance, the overhead in the kill slows down the correct linear scaling.

In confide in substantiate our hypothesis, we analyze scaling properties of the four excellent WebGraph variants in phrases of practicing time as we magnify the need of on hand TPU cores. As proven below, even empirically, we stock out peek the anticipated linear decrease in practicing time up to a sweet dwelling, after which the network overhead slows the decline.

Scaling diagnosis of working time because the need of TPU cores are increased. Each figure plots the time taken to prepare for one epoch in seconds. |

**Conclusion**

For easy catch admission to and reproducibility, the ALX code is originate-sourced and could unbiased be without problems shuffle on Google Cloud. In level of truth, we illustrate that a sparse matrix like WebGraph-dense of dimension 135M x 135M (with 22B edges) can be factorized in a colab linked to eight TPU cores in much less than a day. We now like designed the ALX framework with scalability in mind. With 256 TPU cores, one epoch of the excellent WebGraph variant, WebGraph-sparse (365M x 365M sparse matrix) takes around 20 minutes to enact (5.5 hours for your entire practicing shuffle). The final mannequin has around 100B parameters. We hope that the ALX and WebGraph can be useful to both researchers and practitioners working in these fields. The code for ALX can be stumbled on right here on github!

**Acknowledgements***The core crew involves Steffen Rendle, Walid Krichene and Li Zhang. We thank many Google colleagues for serving to at diversified phases of this project. In particular, we’re grateful to the JAX crew for diverse discussions, particularly James Bradbury and Skye Wanderman-Milne; Blake Hechtman for befriend with XLA and Rasmus Larsen for useful discussions about efficiency of linear solvers on TPUs. Finally, we’re also grateful to Nicolas Mayoraz, John Anderson, and Fernando Pereira for offering useful suggestions.*