You Don't Need the GPUs They're Selling You
How an AI Infrastructure That Wastes Most of What it Computes Got Built
The AI industry has invested over $100 billion in GPU infrastructure based on assumptions about computational requirements that may be fundamentally flawed.
This analysis examines structural inefficiencies in transformer architectures and GPU utilization patterns, drawing on academic research and open-source projects to show that memory bandwidth, not compute, is the primary bottleneck in large language model (LLM) inference. Projects such as llama.cpp (93,000+ stars), vLLM (50,000+ stars), ExLlamaV2, MLX, Petals, and AirLLM demonstrate that layer-wise inference, IO-aware attention, distributed computing, and CPU offloading are reshaping AI deployment economics. Gao et al. documented average GPU utilization of 50% or less across 400 real deep learning jobs on Microsoft’s internal platform, confirming that memory constraints dominate production workloads.
This pattern is not new. Every generation of computer science has faced the choice between smart engineering and brute-force spending. The implications extend beyond cost to questions of access, sustainability, and the fundamental architecture of AI systems.
The Hardware Lie We Tell Ourselves
So here is something that should bother everyone who has been writing checks for GPU infrastructure. A collection of GitHub repositories with over 150,000 developers may have just exposed one of the most uncomfortable truths in AI. We have collectively spent over $100 billion building GPU infrastructure that operates at a fraction of its theoretical capacity. The projects are diverse (llama.cpp, vLLM, ExLlamaV2, MLX, Petals, AirLLM) but they share a premise that sounds almost absurd: you do not need the hardware you have been told you need.
Look at AirLLM — 70-billion parameter model on a 4GB GPU. Or llama.cpp running inference on consumer CPUs that were supposed to need data center hardware. Or Petals distributing model layers across volunteer computers over the internet, BitTorrent-style. The fact that any of this works at all should make everyone question the conventional wisdom about what large language models actually require.
Bloomberg reports that global AI infrastructure spending exceeded $150 billion in 2024, mostly flowing to GPUs and data centers. NVIDIA alone captured over $47 billion in data center revenue last fiscal year, gaining near-monopolistic control over the hardware powering modern AI. If even a fraction of that investment is based on false assumptions about hardware requirements, we are looking at tens of billions in misallocated capital. This might be the largest example of collective inefficiency in technology history.
And it is not just about money. The hardware paradigm has created barriers that concentrate AI capabilities among a few well-funded players. A researcher at a small university cannot afford GPT-4-scale experiments. A startup in an emerging market cannot compete with hyperscalers for GPU allocation. If these barriers are artificial, only existing because we optimized for the wrong constraints, then we have inadvertently built an oligopoly on computational intelligence. How many times do we need to learn this lesson?
The Oldest Problem in Computing
Before we get into GPU inefficiency specifically, here is a history lesson that keeps repeating and we keep forgetting: the tension between compute and memory has defined this field since vacuum tubes.
In 1946, the ENIAC could perform 5,000 additions per second but had only 20 words of internal memory. Engineers spent more time managing data movement than designing algorithms. The first stored-program computers of the late 1940s used mercury delay lines that could store a few thousand bits, and those bits had to be carefully orchestrated to keep the processor fed. The Williams tube, the first random-access memory, held perhaps 2,048 bits. Every generation of computer scientists has faced the same fundamental problem of processors that can compute faster than memory can deliver data.
Wulf and McKee formalized this observation in their seminal 1994 paper identifying the “memory wall,” which highlighted the growing disparity between processor speed and memory bandwidth [9]. They predicted that this gap would become the dominant constraint on system performance. They were right. In 1980, DRAM latency was roughly comparable to processor cycle time. By 2020, processor speeds had improved by roughly 10,000x while DRAM latency had improved by only 10x. The gap continues to widen.
Throughout this history, computer scientists have faced a choice. They can either engineer around the constraint or spend their way past it. The smart path involves careful algorithm design, cache-aware programming, memory hierarchy optimization, and data structure engineering. The brute-force path involves buying more hardware. Both approaches have their place, but history shows that the smart path often delivers order-of-magnitude improvements that no amount of spending can match.
Take the transition from bubble sort (O(n²)) to quicksort (O(n log n)). No amount of hardware improvement would have made bubble sort competitive, making the algorithmic improvement necessary. Or consider the development of B-trees for database indexing, which transformed disk access patterns from linear scans to logarithmic searches. These reconceptualizations of the problem made previously intractable workloads practical.
The AI industry is facing the same choice, and it has overwhelmingly chosen brute force. When capital is abundant and competitive pressure is intense, throwing hardware at the problem is the fastest path to capability. But this approach creates technical debt. It establishes patterns that become expensive to unwind, and it may not even be sustainable. Silicon has physical limits, energy costs are real, and capital is not infinite. The spending path has a ceiling. We are hitting it.
The Memory Wall We Pretend Does Not Exist
There exists an inconvenient truth that the AI industry has papered over with hardware purchases.
Modern GPUs spend most of their time waiting for data, not computing.
The roofline model, introduced by Williams, Waterman, and Patterson at Lawrence Berkeley National Laboratory in 2008 [3], provides a framework for understanding this relationship. Every computation exists somewhere on a graph bounded by two constraints: peak computational throughput (FLOPS) and memory bandwidth (bytes per second). The “ridge point” where these lines intersect determines whether a workload is compute-bound or memory-bound.
For an NVIDIA H100 GPU, the current gold standard for AI training and inference, the ridge point occurs at an arithmetic intensity of roughly 300 operations per byte. This means that to keep the H100’s computational units fully utilized, a workload must perform 300 floating-point operations for every byte transferred from memory. Below this threshold, the GPU is memory-bound, with its computational cores idly waiting for data. The H100 can deliver nearly two petaflops of FP16 tensor computation, but this capacity is meaningless if data cannot arrive fast enough to keep it busy.
Most operations in large language model inference fall well below this threshold. For each new token, the model must compute attention scores against all previous tokens and multiply activations through billions of parameters. But the arithmetic intensity of these operations is low because the data — weights, activations, key-value caches — must be loaded fresh for each forward pass. The weights alone for a 70-billion parameter model at FP16 precision occupy 140 GB. For a single-token generation step with batch size one, the entire 140 GB must be moved to generate a few thousand multiply-accumulate operations per weight. The arithmetic intensity works out to roughly one operation per byte, which is 300 times lower than what the GPU needs to achieve peak utilization.
Gao et al. documented this in their 2024 ICSE paper examining 400 real deep learning jobs on Microsoft’s internal platform [2]. Average GPU utilization was 50% or less for production AI at one of the world’s most sophisticated tech companies. The paper won a distinguished paper award, not because the findings were controversial, but because nobody had bothered to systematically measure what everyone already suspected.
They identified 706 distinct low-utilization issues — basically a taxonomy of inefficiency that reads like an indictment of our entire approach to AI infrastructure. Data loading bottlenecks, suboptimal batch sizes, framework overhead, I/O contention. But underneath all of it existed the same problem. Nobody designed the software with memory hierarchy as a first-class constraint.
The Open-Source Efficiency Revolution
While the industry was scaling up hardware, a parallel movement was scaling down requirements. Open-source projects built by individual contributors and small teams have been proving the emperor has no clothes. Our own company has been running multi-billion parameter models on off-the-shelf laptops for a while now. Here is what is out there.
A. llama.cpp: 93,000 Stars and Counting
In March 2023, Georgi Gerganov released llama.cpp, a pure C/C++ implementation of LLaMA inference with no dependencies [10]. The premise was to run large language models efficiently on consumer hardware, including CPUs without any GPU acceleration. Within two years, it had accumulated over 93,000 GitHub stars, attracted 1,418 contributors, and become the foundation for dozens of downstream applications including Ollama, LM Studio, and GPT4All.
The technical work is solid. llama.cpp introduced aggressive quantization (1.5-bit to 8-bit integer representations), the GGUF file format for efficient model storage and loading, and highly optimized kernels for CPU inference using AVX, AVX2, AVX512, and ARM NEON instructions. On Apple Silicon, it runs fast through Metal framework integration. NVIDIA’s own engineers have contributed CUDA Graph optimizations that achieve approximately 150 tokens per second on an RTX 4090 for Llama 3 8B.
It shows developers that memory hierarchy should be treated as a first-class design constraint. By carefully managing data layout, quantization, and cache utilization, llama.cpp shows that much of the “required” GPU hardware was compensating for inefficient software. A Llama 3 8B model that supposedly needs a data center GPU runs fine on a MacBook. Software always beats hardware. We are just watching it play out on a new stage.
B. vLLM and PagedAttention: Virtual Memory for AI
Developed at UC Berkeley’s Sky Computing Lab, vLLM introduced PagedAttention, an attention algorithm that applies the classical virtual memory and paging techniques from operating systems to KV cache management [4]. The project has accumulated over 50,000 GitHub stars and is now deployed in production at numerous organizations including major cloud providers.
The problem PagedAttention solves is KV cache fragmentation. During autoregressive generation, the key-value cache for each request grows dynamically and unpredictably. Traditional systems pre-allocate contiguous memory blocks, wasting 60%-80% of GPU memory to fragmentation and over-reservation. PagedAttention partitions the KV cache into fixed-size blocks that can be stored non-contiguously, managed via a block table analogous to a page table in an operating system.
The numbers back its efficiency: 2-4x throughput improvement over FasterTransformer and Orca with the same latency. For parallel sampling and beam search, memory sharing reduces overhead by up to 55%, translating to 2.2x throughput improvement. The system supports continuous batching that dynamically replaces completed sequences with new ones, maximizing GPU utilization. These gains come purely from better memory management. The same hardware, the same model, just smarter software.
C. ExLlamaV2: Mixed-Precision at the Layer Level
ExLlamaV2 takes a different approach to efficiency, using mixed-precision quantization that varies within a model [13]. The EXL2 format supports two, three, four, five, six, and eight-bit quantization, with the ability to mix precision levels not just between layers but within each linear layer. More important weights (those that contribute more to output accuracy) get more bits; less important weights get fewer.
The quantization process uses GPTQ-style optimization but with finer granularity. Parameter selection is automatic, based on measuring quantization error against calibration data for each possible setting. The result is models that achieve a target average bitrate (say, 4.0 bits per weight) while preserving accuracy better than uniform quantization. Benchmarks show ExLlamaV2 achieving 56+ tokens per second on a T4 GPU — faster than GPTQ, faster than llama.cpp’s GGUF format, with comparable or better accuracy.
The project now supports paged attention via Flash Attention 2.5.7+, includes a dynamic generator with smart prompt caching and K/V cache deduplication, and supports speculative decoding for additional speedups. ExLlamaV3 extends this with the EXL3 format, a streamlined variant of QTIP from Cornell that can convert a model in a single step with a fused Viterbi kernel.
D. MLX: Apple’s Unified Memory Advantage
Apple’s MLX framework exploits a hardware advantage that the NVIDIA-centric AI industry has largely ignored in unified memory architecture [12]. On Apple Silicon, the CPU, GPU, and Neural Engine share the same physical memory pool, eliminating the PCIe transfers that bottleneck discrete GPU systems. The framework has rapidly accumulated over 21,000 GitHub stars.
This matters for throughput. An M4 Max with 128 GB of unified memory is comparable to high-end data center GPUs, providing 546 GB/s of bandwidth accessible to any processor. Recent benchmarks from vllm-mlx show 21%-87% higher throughput than llama.cpp across models from 0.6B to 30B parameters on Apple Silicon. For multimodal workloads, content-based prefix caching achieves up to 28x speedup on repeated image queries and 24.7x on video analysis.
With the M5 chip’s Neural Accelerators, inference speeds improve another 4x for time-to-first-tokens on compute-bound operations. MLX supports on-the-fly quantization that can convert a 7B Mistral model to 4-bit in seconds. A researcher with a high-end MacBook can now run experiments that previously required cloud GPU allocation, and can do so with better energy efficiency than a data center setup.
E. Petals: BitTorrent for Language Models
Petals goes furthest, distributing the model across the internet [8]. Developed through the BigScience collaboration by researchers at University of Washington, Hugging Face, and ENS Paris-Saclay, Petals allows users to run Llama 2 70B or even Llama 3.1 405B by connecting to a swarm of volunteer GPUs that each host a subset of model layers.
The architecture works by orchestrating each server to load several transformer blocks while a distributed hash table coordinates which servers hold which layers. When a client sends a request, it is routed through a chain of servers chosen to minimize total forward pass time. The system handles server disconnections gracefully through fault tolerance and load balancing, automatically re-routing when participants join or leave the swarm.
Performance is usable. Llama 2 70B achieves six tokens per second while Falcon 180B reaches four tokens per second — fast enough for interactive chatbots. More importantly, Petals is 3-25x faster than CPU offloading for single-batch inference in realistic network conditions. The project proves that the constraint is data movement, not compute. Even internet latency beats the bandwidth constraints of moving 140 GB of weights through a PCIe bus repeatedly.
F. AirLLM: The 4GB Proof of Concept
AirLLM shows the extreme case: 70B parameter inference on 4 GB of GPU memory, 405B on 8 GB [11]. The approach is layer-wise loading — process one transformer layer at a time, loading weights from storage as needed, using HuggingFace Accelerate’s meta device feature to defer actual memory allocation.
The insight is that during inference, simultaneous access to all layers is not necessary. Sequential access is, as a forward pass proceeds from the first layer to the last. At any moment, only one layer is actively computing, while the other 79 layers in a 70B model sit idle in memory. AirLLM trades that idle memory for active loading, using safetensor format for memory-mapped loading that maximizes speed.
With prefetching (overlapping the loading of layer N+1 with the computation of layer N), block-wise quantization during transfer (2-4x bandwidth reduction), and NVMe SSD bandwidth of 7 GB/s, the latency penalty is manageable. Version 2.0 added compression support that provides up to 3x speed improvement with minimal accuracy loss. The project now supports CPU inference, macOS, and models including Llama 3.1 405B, all while running on hardware that NVIDIA would recommend 640 GB of GPU memory for.
The Attention Paradox
Nowhere is the inefficiency more pronounced than in the transformer’s self-attention mechanism, the very innovation that made modern language models possible.
Self-attention computes pairwise interactions between all tokens in a sequence, allowing it to capture long-range dependencies that eluded earlier architectures like recurrent neural networks. But this capability comes with quadratic complexity: double the sequence length, quadruple the computation and memory.
For a 4,096-token context (modest by current standards, as models like Claude and GPT-4 support 100,000 tokens or more), the attention matrix contains over 16 million entries per attention head. A typical large language model has 32-128 attention heads across dozens of layers. The memory required to store these matrices, combined with the key-value caches that enable efficient autoregressive generation, quickly dominates the total memory footprint of inference.
Tri Dao’s FlashAttention paper, presented at NeurIPS 2022, revealed that standard attention implementations were catastrophically inefficient [1]. The problem was not the algorithm’s computational requirements; it was the memory access patterns. Standard implementations materialize the full N×N attention matrix in GPU high-bandwidth memory (HBM), requiring multiple round trips between the GPU’s compute units and its relatively slow main memory.
The GPU memory hierarchy makes this worse. The A100 GPU is impressive by any measure, having 80 GB of HBM with bandwidth of 2 TB/s. But it also has 192 KB of on-chip SRAM per streaming multiprocessor with bandwidth estimated around 19 TB/s. That is nearly a 10x difference. Standard attention implementations ignore this hierarchy entirely, treating all memory as equally expensive to access. FlashAttention restructures the computation to work in tiles that fit in fast on-chip SRAM, achieving up to 7.6x speedup on GPT-2 while using linear rather than quadratic memory.
This is Computer Architecture 101: cache-aware algorithms, tiled matrix multiplication, loop blocking. These are standard techniques in scientific computing that date back decades, yet it took until 2022 for someone to systematically apply them to the most compute-intensive operation in modern AI. FlashAttention-2 pushed utilization from 25%-40% to 50%-73% [7]. Still not optimal, but a major jump from paying attention to memory access patterns.
The Economics of Waste
Cloud providers charge $2-$4 per hour for an A100 GPU. A typical 70B model inference setup needs four to eight of them running in parallel. That is $8-$32 per hour before networking, storage, and overhead.
If layer-wise inference, quantization, and memory-efficient attention can deliver equivalent functionality on consumer hardware, we are not talking about incremental savings, but two orders of magnitude. An RTX 4090 costs about $1,600, which is 50-200 hours of A100 rental. For research, batch processing, dev/test, education — anywhere latency is not critical — the economics become a no-brainer.
Beyond cost savings, this changes who can participate in AI development. The current paradigm has created a world where meaningful AI capabilities require either massive capital investment or dependence on a handful of cloud providers. OpenAI spent an estimated $100 million training GPT-4. Academic researchers at smaller institutions, startups without venture backing, and developers in emerging markets are effectively locked out of working with state-of-the-art models.
Layer-wise inference and similar techniques could democratize access to large models in a way that reduced training costs alone cannot. Training a foundation model is a one-time cost; inference is ongoing. If running these models requires a fraction of the hardware currently assumed necessary, the barriers to entry drop dramatically. A university research lab could explore model behavior without cloud bills that exceed their equipment budgets. A startup could prototype with the same models that power enterprise applications.
The environmental angle is similar. Data center energy consumption for AI workloads is growing exponentially. If half the power consumed by GPUs is being wasted due to memory bottlenecks as the Microsoft study suggests [2], then the entire AI industry is burning megawatts of electricity on operations that never actually occur. Multiply this by the thousands of GPUs in a major AI data center, and the environmental cost of inefficiency adds up fast.
Why Did We Build It This Way?
If efficient inference was possible all along, why did everyone standardize on approaches that waste most of their compute?
Historical accident, misaligned incentives, and cultural blind spots. The usual suspects.
The historical accident is straightforward. The transformer architecture emerged from Google Brain in 2017, where computational resources were effectively unlimited. The original “Attention Is All You Need” paper explicitly optimized for parallelism and throughput at scale, not for memory efficiency. This made sense for Google’s training infrastructure — they had thousands of TPUs and GPUs at their disposal, and getting models trained faster was more valuable than reducing hardware requirements. But these design patterns were blindly replicated as the models were deployed more broadly, even in contexts where the original assumptions no longer held.
The misaligned incentives go deeper. NVIDIA has every reason to sell bigger, more expensive GPUs. If customers can run workloads on smaller hardware, revenue suffers. Yes, they have invested in CUDA and cuDNN optimizations, but those have focused on improving throughput on high-end hardware, not enabling deployment on consumer devices. Data center revenue grew 409% year-over-year last quarter. That number depends on customers believing they need expensive hardware. Connect the dots.
Cloud providers have the same problem. AWS, Google Cloud, and Azure make money when you use more resources, not fewer. They will give you convenience and integration, but they have zero motivation to help you minimize compute. Their entire business model is consumption.
The cultural blind spot matters more than people admit. Academic incentives reward novel architectures and state-of-the-art performance on benchmarks, not engineering optimizations that make existing systems cheaper to run. A paper showing 3% accuracy improvement on a language modeling benchmark is more publishable than one showing 3x cost reduction for equivalent performance. The prestigious venues (NeurIPS, ICML, ICLR) have traditionally favored algorithmic novelty over systems engineering.
There is also a knowledge gap. The deep learning community drew heavily from math and statistics backgrounds, where memory hierarchies and cache behavior are foreign concepts. Computer architecture, systems programming, performance engineering — different fields, different cultures, different publication venues. The people who understood memory optimization were not the same people building ML frameworks. It is like asking quantum physicists to optimize database queries.
The Road Ahead
These open-source projects are not production-ready solutions for all use cases. The latency trade-offs are real. Generating tokens takes longer when loading layers from disk or routing through internet servers. The throughput for high-volume applications will not match dedicated inference hardware running at full occupancy. Interactive applications requiring immediate responses may not tolerate the additional latency.
But these projects represent something more important than individual optimizations. They are proof that the constraints we have accepted are artificial. Once you demonstrate that a 70B model can run on 4 GB of memory, or that Llama 2 70B can be served by a swarm of volunteers over the internet, the conversation changes: “How do we make it practical for different use cases?”
The broader research community is beginning to respond. Work on KV-cache compression shows promising results for reducing memory footprint during long-context generation. MiniCache demonstrated that cross-layer redundancy in attention states can be exploited for substantial memory savings. HeadInfer showed that head-wise offloading can extend context lengths significantly with minimal accuracy impact.
Mixture-of-experts architectures with dynamic routing can reduce active parameters per token while maintaining model capability. Attention-free language models like RWKV and Mamba offer linear-time alternatives to quadratic attention, with memory requirements that scale much more gracefully with sequence length.
These techniques all share a common thread. They treat memory as a precious resource to be optimized rather than an unlimited input to be maximized. This shift in perspective may ultimately matter more than any individual technique. When memory efficiency becomes a first-class design constraint, the entire space of possible architectures opens up.
Conclusion
AI infrastructure is hitting a wall. The path of adding more GPUs to solve scaling challenges is running into physical, economic, and environmental limits. NVIDIA’s data center revenue growth is sustained by capital investment that assumes current architectures are optimal. If layer-wise inference, quantization, and memory-efficient attention become mainstream, the demand curve shifts hard.
This pattern repeats throughout computing history. When resources are abundant, we solve problems by spending. When resources get constrained, or when someone actually measures utilization, we discover that smart engineering beats money. The engineers who optimized for mercury delay lines knew this. The database engineers who built B-trees knew this. The web developers who invented CDNs knew this.
AI is now mature enough that efficiency cannot be an afterthought. The easy scaling gains are done. Environmental costs are impossible to ignore. The concentration of capabilities among a few players is raising real concerns about competition and access. A reckoning is coming.
Seventy billion parameters on a 4 GB GPU. Llama 2 70B served by volunteers over the internet. Over 93,000 developers building efficient inference on consumer hardware. These sound like magic tricks. They are not. They are just careful engineered solutions to a problem the industry — intoxicated by scaling laws and flush with venture capital — never bothered to solve. The magic trick was convincing everyone it was not possible. The real work is rebuilding infrastructure around what is actually necessary.
Afterword: A Personal Perspective on AI Infrastructure
I have spent years working at the intersection of technology and regulated industries, helping banks modernize infrastructure and advising government agencies on tech adoption. You develop a sensitivity to the gap between what vendors promise and what organizations actually need. The AI infrastructure market has an enormous gap.
When a major bank asks if they need an A100 cluster for their compliance AI assistant, the honest answer is usually “no.” They need reliable inference, reasonable latency, and strong security. None of that requires the hardware that hyperscaler marketing suggests. But the industry’s default recommendation is always maximum hardware investment. Funny how that works.
This pattern is not new. Every technology cycle produces its own “you need more hardware” narrative. What is different now is scale and speed. Organizations are making billion-dollar infrastructure decisions based on assumptions nobody has rigorously tested.
For practitioners: benchmark your actual workloads before accepting vendor recommendations. Test layer-wise inference, quantization, CPU offloading on your specific use cases. Figure out whether your bottleneck is compute or something more mundane like memory bandwidth. The results usually tell a different story than the sales pitch.
The $100 billion has already been spent. It is time to build what’s next.
- Sultan
This article was written by Sultan Meghji, CEO of Frontier Foundry and former Chief Innovation Officer at the FDIC. Visit his LinkedIn here.
To stay up to date with our work, visit our website, or follow us on LinkedIn, X, and Bluesky. To learn more about the services we offer, please visit our product page.
References
[1] T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, “FlashAttention: Fast and memory-efficient exact attention with IO-awareness,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), Stanford Univ., Stanford, CA, USA and Univ. at Buffalo, Buffalo, NY, USA, 2022.
[2] Y. Gao, Y. He, X. Li et al., “An empirical study on low GPU utilization of deep learning jobs,” in Proc. IEEE/ACM 46th Int. Conf. Softw. Eng. (ICSE), 2024, Distinguished Paper Award.
[3] S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,” Commun. ACM, vol. 52, no. 4, pp. 65–76, 2008.
[4] W. Kwon, Z. Li, S. Zhuang et al., “Efficient memory management for large language model serving with PagedAttention,” in Proc. ACM SIGOPS 29th Symp. Operating Syst. Principles (SOSP), UC Berkeley, Berkeley, CA, USA, 2023.
[5] X. Jiang, Y. Zhou et al., “NEO: Saving GPU memory crisis with CPU offloading for online LLM inference,” in Proc. Mach. Learn. Syst. (MLSys), UC Berkeley, UC Davis, and Harvard, 2025.
[6] Y. Sheng, L. Zheng, B. Yuan et al., “FlexGen: High-throughput generative inference of large language models with a single GPU,” in Proc. Int. Conf. Mach. Learn. (ICML), Stanford, UC Berkeley, and ETH Zurich, 2023.
[7] T. Dao, “FlashAttention-2: Faster attention with better parallelism and work partitioning,” arXiv:2307.08691, Princeton Univ., Princeton, NJ, USA, 2023.
[8] A. Borzunov et al., “Petals: Collaborative inference and fine-tuning of large models,” in Proc. Assoc. Comput. Linguistics (ACL), Univ. of Washington, Hugging Face, and ENS Paris-Saclay, 2023.
[9] W. A. Wulf and S. A. McKee, “Hitting the memory wall: Implications of the obvious,” ACM SIGARCH Comput. Archit. News, vol. 23, no. 1, pp. 20–24, 1994.
[10] G. Gerganov et al., “llama.cpp: LLM inference in C/C++,” GitHub, 2023–2026, 93,000+ stars. [Online]. Available: https://github.com/ggml-org/llama.cpp
[11] G. Li, “AirLLM: Scaling large language models on low-end commodity computers,” GitHub, 2023. [Online]. Available: https://github.com/lyogavin/airllm
[12] A. Hannun et al., “MLX: Efficient and flexible machine learning on Apple silicon,” Apple Mach. Learn. Res., 2023, 21,000+ stars.
[13] ExLlamaV2 Contributors, “ExLlamaV2: A fast inference library for running LLMs locally,” GitHub, 2023–2026. [Online]. Available: https://github.com/turboderp-org/exllamav2



