News & Analysis

AI Infrastructure Puts Companies in a Bind

Speedy adoption without adequate GenAI use cases is resulting in poor TCOs for companies

A recent survey indicated that IT leaders across the world were facing a tough situation over expanding their AI-led infrastructure. Many complained that teams were moving far too quickly on expanding their AI capabilities without first appropriately prioritizing GenAI use cases in their respective organizations. 

The report, prepared by the AI Infrastructure Alliance (AIIA) after surveying more than 1,000 global organizations in the AI and machine learning (ML) space, said a massive 96% of the respondents were busy scaling up their AI computing infrastructure, capacity and investments during 2024. 

However, more than half the respondents said they planned to use large language models such as the LLaMA as part of this process while around 26% had plans to use embedded models for their commercial AI deployments. There was a consensus around the availability, cost and architecture design of AI infrastructure presenting major challenges. 

Computing challenges around GPUs and efficiency

Most IT leaders who responded to the survey said they were aware of the computing challenges around AI-led models such as GPU availability and the overall costs involved. But, their bigger concern was around deploying GenAI applications without first considering issues such as prioritizing wrong business use cases. 

The survey report also noted that another major concern among the respondents related to the slowing down of GenAI applications due to a variety of reasons that could include lack of execution ability and ambiguity among the leadership on prioritization. It said, several IT leaders confirmed that they were caught between the pressure to innovate and the challenges that could emerge in case of errors in planning. 

Benchmarking solutions can help, but AI champions need clarity

One of the ways that companies are looking to resolve this challenge is to seek clarity through peer reviews and industry benchmarks on the AI platforms available at this point in time. However, in spite of their availability, the leaders feel that IT leaders need to identify top priority challenges and develop GenAI use cases around them. 

While evaluating AI infrastructure around business use cases, IT leaders and their AI champions must also take into account factors such as total cost of ownership (TCO) that includes computing, scheduling, latency and energy efficiency. The AIIA says that only in such scenarios can IT leaders confidently and accurately predict and forecast TCO for GenAI in their respective enterprises. 

The focus must be on total cost of ownership (TCO)

The survey notes a meaningful shift in AI hardware usage and a demand for financially efficient computing prowess for AI-led insights. More than half the respondents said they were actively seeking cost-effective alternatives to address GPU scarcity, while 27% of the leaders said they sought wallet-friendly GPUs for AI training. 

Industry experts say that high performance and cost-effectiveness of GPUs and their optimal utilization would be critical for the success of GenAI innovations during 2024. In fact, one of the biggest challenges on this front relates to idle GPU resources that results in cost escalations. 

While 78% of the respondents said they were using more than 50% of their total GPU resources during peak periods of activity, barely 7% said that their GPU infrastructure reaches more than 85% utilization during such periods. Which means a crucial requirement to make the GenAI TCO effective is to improve the management of existing computing resources and expand infrastructure with alternatives.