LIPS

Name: LIPS
Start: 2024-02-21T09:00:00+01:00
End: 2024-02-23T18:00:00+01:00
Location: DESY

21–23 Feb 2024

DESY

Europe/Berlin timezone

Overview
Scientific Programme
Call for Abstracts
Timetable
Contribution List
Book of Abstracts
Registration
Participant List
Practical information

Efficient Matrix Multiplication Algorithms for Quantized Language Models

Not scheduled

20m

Auditorium (DESY)

Auditorium

DESY

Notkestr. 85 22607 Hamburg

Other

Johannes Gäßler (Karlsruhe Institute of Technology)

Large language models have - as the name implies - large numbers of parameters. As such not only the training costs but also the inference costs of these models are quite substantial. One strategy for reducing inference costs is to quantize the model weights from 16 bit floating point values to a format with 2-8 bits per weight. However, these custom data formats in turn require custom inference code. This talk describes the interplay of llama.cpp quantization formats and inference code and how int8 tensor cores or integer intrinsics can be used to reach performance exceeding that of standard floating point GEMM routines provided by e.g. cuBLAS.

Johannes Gäßler (Karlsruhe Institute of Technology)

There are no materials yet.

LIPS

Efficient Matrix Multiplication Algorithms for Quantized Language Models

Auditorium

DESY

Speaker

Description

Primary author

Presentation materials

Choose timezone

LIPS

Speaker

Description

Primary author

Presentation materials