Performance

  • Published on
    Lately, I have been working on a high-throughput API where most reads are served from Redis. Redis is already fast, but my goal was to understand how low I could push **p99 latency**, and more importantly, how stable I could make the tail. This post is about end-to-end p99 tuning for a Redis-backed API. Some optimizations target Redis directly. Others target the client, request shape, and network behavior. While I use Redis here, these optimizations targeting the network stack, request shape, and client-side caching apply to almost any distributed system.
  • Published on
    When storing data in memory, the data type used to represent the data has an impact on the memory usage and the performance of the overall system. Consider saving a number. On a high level, the number can either be an integer (whole number) or a floating-point number (number with decimal). Floating-point numbers can represent larger range of numbers with higher precision. Weights and biases in a large language model, which are learned during training and are used to make predictions, are stored as floating-point numbers to maintain high precision. The count of these parameters is what constitutes the size of the model, memory usage and how much computational resources are needed to run the model. In this post, we will discuss how quantization can be used to reduce the memory usage of models and improve performance (assuming the loss of precision is acceptable).