💻🏗️🧪✍️📝communitySource: blog.kaving.meeveryone

Tracking down a 25% Regression on LLVM RISC-V – KG's Blog

Similar to the previous post, this post covers my analysis of a benchmark on RISC-V targets. Unlike the previous post, I was able to land a patch to eliminate the performance gap to GCC (for this benchmark)!

April 13, 202622 min read4,215 words

View original source

Share:𝕏 💼📘🔴🟧

Welcome ↗ Tracking down a 25% Regression on LLVM RISC-V

# Tracking down a 25% Regression on LLVM RISC-V

April 9, 2026· Kavin Gnanapandithan Similar to the previous post, this post covers my analysis of a benchmark on RISC-V targets. Unlike the previous post, I was able to land a patch to eliminate the performance gap to GCC (for this benchmark)!

TLDR

A recent LLVM commit ↗ improved isKnownExactCastIntToFP to fold fpext(sitofp x to float) to double into a direct uitofp x to double cast, but this inadvertently broke a downstream narrowing optimization in visitFPTrunc that relied on the fpext to narrow a double to float, causing a ~24% performance regression on RISC-V targets, where fdiv.d (33 cycle latency) was emitted instead of fdiv.s (19 cycle latency).

My fix ↗ extends getMinimumFPType with range analysis to recognize that fptrunc(uitofp x double) to float can be reduced to uitofp x to float, restoring the narrowing optimization.

Analysis [](https://blog.kaving.me/blog/tracking-down-a-25-regression-on-llvm-risc-v/#analysis)I was looking at Igalia’s site comparing the performance of LLVM to GCC on RISCV targets, and I noticed this particular benchmark ↗.

As shown in the image below, LLVM requires about ~8% more cycles than GCC for that specific benchmark on the SiFive P550 CPU.

I have included snippets of the relevant basic block assembly. Practically all the cycles were spent on the assembly below.

LLVM [](https://blog.kaving.me/blog/tracking-down-a-25-regression-on-llvm-risc-v/#llvm)

GCC [](https://blog.kaving.me/blog/tracking-down-a-25-regression-on-llvm-risc-v/#gcc)

From the two assembly, it wasn’t immediately obvious to me why GCC was doing better. They seemed almost identical, and if anything, LLVM was able to optimize the branch logic here. The big difference I did notice was that LLVM was doing an fdiv.d, a division with double precision floating point or an f64.

This did seem promising so I decided to run llvm-mca on the source code to get a better idea of what is happening. Note that I am using llvm-mca built from source that was a couple of days old from upstream. From the analysis done by llvm-mca, I noticed that the fdiv.d instruction made no appearance in the loop. Although I did not show it above, both GCC and LLVM contained an fdiv.d instruction in a later basic block but, this was outside the main loop and therefore not relevant to the performance difference.

Info

llvm-mca is a tool within the llvm suite that can be used to statically measure the performance of machine code for a specific CPU.

text

$LLVMBUILDDIR/bin/llvm-mca -mtriple=riscv64 -mcpu=sifive-p550 pi.s

Around the area where there should have been an fdiv.d, there were two fdiv.s like GCC. From this, I concluded that the fdiv.d instruction in the loop must have been a recent regression. LLVM used to be able to narrow the double to a float, but the latest builds can no longer convert the double into a float.

I confirmed this was indeed a regression by comparing LLVM to a prior build for the same CPU and benchmark.

https://cc-perf.igalia.com/dbdefault/v4/nts/profile/260/406/4 ↗ Below is the assembly generated by the prior build of LLVM.

No fdiv.d instruction on the prior build. In its place, an fdiv.s is used.

Out-of-Order Execution [](https://blog.kaving.me/blog/tracking-down-a-25-regression-on-llvm-risc-v/#out-of-order-execution)If you are wondering as to why the ordering of the assembly above and the assembly from the new LLVM build look different, it’s because the target CPU, the SiFive P550, is an out of order CPU. Unlike the CPU in the Banana Pi mentioned in the previous post which was an in-order CPU, this target can execute instructions in an order that can produce a higher throughput.

I’m not exactly sure why the fdiv.d is higher, but my suspicions would be that since the double division has a significantly higher latency, the CPU is trying to dispatch other instructions to ‘hide’ this latency. The fcvt.s.d instruction that consumes the value of the fdiv.d, ft1, would need to wait for 33 cycles, so perhaps the CPU made the decision to schedule instructions between them.

Where is this happening? [](https://blog.kaving.me/blog/tracking-down-a-25-regression-on-llvm-risc-v/#where-is-this-happening)At this point in time we don’t know why this happened but a step in the right direction would be to figure out where it happens. Perhaps it was some change in the RISCV backend? The command below gives us the commits related to RISC-V.

text

git log --after=&#34;2026-04-01 00:11&#34; --before=&#34;2026-04-04 00:10&#34; | grep -E &#34;RISC&#34;

text

[RISCV] Select add(vec, splat(scalar)) to PADDS for P extension (#190303)
[RISCV] Allow coalesceVSETVLIs to move an LI if it allows a vsetvli to be mutated. (#190287)
[RISCV][TTI] Update cost and prevent exceed m8 for vector.extract.last.active (#188160)
[RISCV] Check EnsureWholeVectorRegisterMoveValidVTYPE in RISCVInsertVSETVLI::transferBefore. (#190022)
[RISCV] Remove codegen for vpctlz, vpcttz, vpctpop (#189904)
Part of the work to remove trivial VP intrinsics from the RISC-V
[RISCV] Move unpaired instruction back in RISCVLoadStoreOptimizer (#189912)
RISCVLoadStoreOptimizer moves the instruction adjacent to the other
[RISCV] Fix stackmap shadow trimming NOP size for compressed targets (#189774)
[RISCV] Relax VL constraint in convertSameMaskVMergeToVMv (#189797)
[RISCV] Add SATIRV64/USATIRV64 to RISCVOptWInstrs. (#190030)
and the RISCVISD::SATI encoding uses the type width minus one.
[RISCV][MCA] Update sifive-p670 tests to consume input files instead (#189785)
[RISCV] Remove codegen for vpminnum, vpmaxnum (#189899)
Part of the work to remove trivial VP intrinsics from the RISC-V
[RISCV] Add RISCVISD::USATI/SATI to computeKnownBitsForTargetNode/ComputeNumSignBitsForTargetNode. (#189702)
[RISCV] Add assertions to VSETVLIInfo::hasSEWLMULRatioOnly(). NFC (#189799)
[RISCV] combine-isfpclass.ll - add initial tests showing failure to constant fold ISD::ISFPCLASS nodes (#189940)
[RISCV] Remove codegen for VP float rounding intrinsics (#189896)
Part of the work to remove trivial VP intrinsics from the RISC-V
[RISCV] Remove codegen for vplrint, vpllrint (#189714)
Part of the work to remove trivial VP intrinsics from the RISC-V
[RISCV] Add codegen support for SATI and USATI. (#189532)

None of these commits immediately stood out to me, so I put them off. Investigating the middle-end, I decided to look at the final LLVM IR produced by opt with my local LLVM version that was a few days old. Remember that my build is the ‘working’ build. Here is the LLVM IR produced at the beginning of the pipeline (right after clang), and at the end.

Gets the LLVM IR right after the pipeline:

text

$LLVMBUILDDIR/bin/clang -O3 \
--target=riscv64-unknown-linux-gnu \
-march=rv64gczbazbb \
--sysroot=/usr/riscv64-linux-gnu \
-Xclang -disable-llvm-passes \
-S -emit-llvm pi.c -o pirawold.ll

Gets the LLVM IR at the end of the optimization pipeline:

text

$LLVMBUILDDIR/bin/clang -O3 \
  --target=riscv64-unknown-linux-gnu \
  -march=rv64gczbazbb \
  --sysroot=/usr/riscv64-linux-gnu \
  -S -emit-llvm pi.c -o pi.ll

text

graph LR
    %% Main Compiler Flow
    Source[Source Code<br/><b>pi.c</b>] --> FE[<b>Front-End</b><br/><i>Clang/Lexer/Parser</i>]
    
    subgraph MiddleEnd [&#34;<b>Middle-End</b> (Optimization Pipeline)&#34;]
        direction LR
        MEStart(<b>Beginning of Middle-End</b><br/><i>Unoptimized IR</i>) --> Opts[<i>InstCombine, LoopUnroll, GVN, etc.</i>]
        Opts --> MEEnd(<b>End of Middle-End</b><br/><i>Optimized IR</i>)
    end
    
    FE --> MEStart
    MEEnd --> BE[<b>Backend</b><br/><i>Code Generator/Instruction Selection</i>]
    BE --> Asm(<b>Assembly</b><br/><i>RISC-V</i>)
    %% Styling Main Pipeline
    style Source fill:none,stroke:#888,stroke-width:1px
    style FE fill:none,stroke:#888,stroke-width:1px
    style MEStart fill:none,stroke:#1e88e5,stroke-width:2px
    style MEEnd fill:none,stroke:#1e88e5,stroke-width:2px
    style Opts fill:none,stroke:#888,stroke-width:1px,stroke-dasharray: 5 5
    style BE fill:none,stroke:#888,stroke-width:1px
    style Asm fill:none,stroke:#888,stroke-width:1px
    style MiddleEnd fill:none,stroke:#444,stroke-width:1px,stroke-dasharray: 3 3
    %% Compact Command Boxes
    %% We use <div> and <code> to keep things tight and prevent block-level gaps
    CmdRaw[&#34;<div style='line-height:1.2;'><b>Command 1: Raw IR</b><br/><code style='background:none;color:inherit;'>-Xclang -disable-llvm-passes</code><br/><small>Produces <b>pirawold.ll</b></small></div>&#34;]
    
    CmdOpt[&#34;<div style='line-height:1.2;'><b>Command 2: Optimized IR</b><br/><code style='background:none;color:inherit;'>-O3 -S -emit-llvm</code><br/><small>Produces <b>pi.ll</b></small></div>&#34;]
    %% Connections
    CmdRaw -.-> MEStart
    CmdOpt -.-> MEEnd
    %% Style Command Notes
    style CmdRaw fill:#fff3e011,stroke:#f57c00,stroke-width:1px,rx:10,ry:10
    style CmdOpt fill:#fff3e011,stroke:#f57c00,stroke-width:1px,rx:10,ry:10

If you are confused as to what I am doing, hopefully the diagram above illustrates this better. If the middle-end is responsible for narrowing the double to a float, I am trying to get an idea of what optimizations are happening to the IR that causes this.

Instead of doing this, you could also use the print-before and print-after on a specific pass to see what the pass is doing to the IR.

This is the relevant snippet of the IR from the beginning of the optimization pipeline.

text

%conv  = sitofp i64 %5 to float        ; int -> float
%conv2 = fpext float %conv to double   ; float -> double 
%div3  = fdiv double %conv2, 7.438300e+04  ; fdiv.d 
%conv4 = fptrunc double %div3 to float ; double -> float

Which was converted into the following by the end of the pipeline.

text

%conv  = uitofp nneg i64 %0 to float   ; int -> float 
%conv4 = fdiv float %conv, 7.438300e+04 ; fdiv.s

We can see that by the end of the middle-end, the double was narrowed to a float. I hope it’s clear from these snippets that it was the middle-end responsible for narrowing this initial double value to a float.

If you’re confused as to why these seemingly redundant cast operations are produced in the first place, taking a cursory glance at the source code can help.

Source ↗

text

int main(int argc, char argv[]) {
  float ztot, yran, ymult, ymod, x, y, z, pi, prod;
  long int low, ixran, itot, j, iprod;
  ...
  for(j=1; j<=itot; j++) {
    iprod = 27611  ixran;
    ixran = iprod - 74383(long int)(iprod/74383);
    x = (float)ixran / 74383.0;
    ...
  }
    
  ...
}

This specific line shows that 74383.0 is a double in the source code, but it can fit within a float.

text

x = (float)ixran / 74383.0;

This is why the LLVM IR has an fpext converting a float to a double, and then a fptrunc of the double back to a float.

This may be obvious, but I’ll state it regardless. If the literal above had an initially been a float like below, the fdiv.d would have never been produced.

text

x = (float)ixran / 74383.0f;

Given a sufficient optimization level, the compiler should still be able to catch something like this, but it’s still cool seeing how such a little change in code can have such a huge difference. In this case, over +19% in cycles!

The table below shows the mapping of LLVM IR to the C source code, as well as a brief note on what the specified IR does.

LLVM IR C Source Notes %mul = mul nuw nsw i64 %ixran.053, 27611 iprod = 27611 ixran integer multiply %0 = urem i64 %mul, 74383 ixran = iprod - 74383(long int)(iprod/74383) compiler optimized modulo via urem %conv2 = uitofp nneg i64 %0 to double (float)ixran cast to double first due to 74383.0 being a double literal %div3 = fdiv double %conv2, 7.438300e+04 / 74383.0 division in double precision because 74383.0 is a double literal in C %conv4 = fptrunc double %div3 to float x = (float)... explicit (float) cast truncates result back to float

The LLVM build at the time was producing the following.

text

%mul = mul nuw nsw i64 %ixran.053, 27611
%0 = urem i64 %mul, 74383
%conv2 = uitofp nneg i64 %0 to double
%div3 = fdiv double %conv2, 7.438300e+04
%conv4 = fptrunc double %div3 to float

We can see that the operands of the div3 operations is %conv2, and a decimal value. %conv2 is the result of a cast operation converting %0 into a double, but the maximum value of %0, 74383, can fit within a float.

Looking at the result from llvm-mca, we can see the following for fdiv.d.

text

1      33    32.00                       fdiv.d	ft1, ft1, fa3

This shows that fdiv.d has a latency of 33 cycles, significantly longer than fdiv.s’s 19 cycles. In our performance comparison, the older LLVM build reported a Reciprocal Throughput (RThroughput) of 86.0, whereas the newer build has increased to 100.0. RThroughput represents the number of clock cycles the processor must wait before it can start executing another instruction of the same type. So the lower, the better.

This shows how a double division is pretty expensive compared to a float division.

From the list of commits in those past few days, the one below immediately stood out to me.

text

[InstCombine] Use ComputeNumSignBits in isKnownExactCastIntToFP (#190235)
For signed int-to-FP casts, ComputeNumSignBits can prove exactness where
computeKnownBits cannot -- e.g. through ashr(shl x, a), b where sign propagation is
tracked precisely but individual known bits are all unknown.

InstCombine is an LLVM middle-end optimization pass that combines adjacent or related instructions into single, more efficient operations. It’s broad in its duties, so examples of InstCombine optimizations can vary. For example, InstCombine can reduce x = x 2 to x = x << 1;

Given that the newest build can no longer cast that integer to a float, this commit message with the int-to-FP casts (sitofp/uitofp) immediately roused my suspicions, and they were confirmed when I built LLVM before and after that commit. Before this change, everything worked and afterward, fdiv.d instructions appeared in the assembly.

I should also note that this patch is an improvement - the InstCombiner pass has more information available to it - but sometimes improvements can cause regression(s) elsewhere in unforeseeable ways. I would posit this gives opportunities for folks like me to contribute 😅

Why is this happening? [](https://blog.kaving.me/blog/tracking-down-a-25-regression-on-llvm-risc-v/#why-is-this-happening)Looking at the diff on this commit, I can see that that the author of the patch added known bit analysis to the function isKnownExactCastIntToFP.

text

static bool isKnownExactCastIntToFP(CastInst &I, InstCombinerImpl &IC) {
  
  ...
  // For sitofp, the sign maps to the FP sign bit, so only magnitude bits
  // (BitWidth - NumSignBits) consume mantissa.
  if (IsSigned) {
    SigBits =
        (int)SrcTy->getScalarSizeInBits() - IC.ComputeNumSignBits(Src, &I);
    if (SigBits <= DestNumSigBits)
      return true;
  }
  return false;
}

You don’t need an in-depth understanding of what this changed did, but we can infer that this change is causing isKnownExactCastIntToFP to return true, when it was returning false before. So I decided to look at all the call sites of isKnownExactCastIntToFP to gain a better understanding of how this could lead to a regression.

The following function calls isKnownExactCastIntToFP. It is called to reduce an fpext instruction if a preceding instruction casts an integer to FP like so: itofp i64 x to float -> fpext float x to double. This can just be reduced to a itofp i64 x to double instead.

text

Instruction InstCombinerImpl::visitFPExt(CastInst &FPExt) {
  // If the source operand is a cast from integer to FP and known exact, then
  // cast the integer operand directly to the destination type.
  Type Ty = FPExt.getType();
  Value Src = FPExt.getOperand(0);
  if (isa<SIToFPInst>(Src) || isa<UIToFPInst>(Src)) {
    auto FPCast = cast<CastInst>(Src);
    if (isKnownExactCastIntToFP(FPCast))
      return CastInst::Create(FPCast->getOpcode(), FPCast->getOperand(0), Ty);
  }
  return commonCastTransforms(FPExt);
}

And this puts it together.

Referring the LLVM IR that was generated at the beginning of the pipeline.

text

%conv  = sitofp i64 %5 to float        ; int -> float
%conv2 = fpext float %conv to double   ; float -> double 
%div3  = fdiv double %conv2, 7.438300e+04  ; fdiv.d 
%conv4 = fptrunc double %div3 to float ; double -> float

And the LLVM IR at the end of the pipeline.

text

%conv2 = uitofp nneg i64 %0 to double
%div3 = fdiv double %conv2, 7.438300e+04
%conv4 = fptrunc double %div3 to float

text

graph TD
    %% Node Definitions
    N1[&#34;%conv = sitofp i64 %5 to float<br/><i>int -> float</i>&#34;]
    N2[&#34;%conv2 = fpext float %conv to double<br/><i>float -> double</i>&#34;]
    N3[&#34;%div3 = fdiv double %conv2, 7.438300e+04<br/><i>fdiv.d</i>&#34;]
    N4[&#34;%conv4 = fptrunc double %div3 to float<br/><i>double -> float</i>&#34;]
    %% Data Flow
    N1 --> N2
    N2 --> N3
    N3 --> N4
    %% Subgraph
    subgraph CastGroup [&#34;Initial Casts&#34;]
        direction TB
        N1
        N2
    end
    %% Annotation Node
    AnnotateNode[&#34;<b>InstCombinerCast</b><br/>Reduces this pattern to %conv2 = uitofp nneg i64 %0 to double&#34;]
    
    %% Styling - Using neutral strokes that work in both modes
    style AnnotateNode fill:#0288d122,stroke:#0288d1,stroke-width:2px,rx:10,ry:10
    style CastGroup fill:none,stroke:#888,stroke-dasharray: 5 5
    %% Invisible edges for layout
    AnnotateNode -.-> N1
    AnnotateNode -.-> N2

The InstCombiner pass after that patch is able to optimize the sitofp i64 to float, followed by fpext float to double by reducing it a single uitofp nneg i64 %0 to double.

However, visitFPTrunc optimized the following pattern as commented in the code:

> // If we have fptrunc(OpI (fpextend x), (fpextend y)), we would like to > // simplify this expression to avoid one or more of the trunc/extend > // operations if we can do so without changing the numerical results. > // > // The exact manner in which the widths of the operands interact to limit > // what we can and cannot do safely varies from operation to operation, and > // is explained below in the various case statements.

The optimized LLVM IR now no longer has that fpext instruction, so the InstCombiner pass and specifically visitFPTrunc can no longer narrow the double to a float.

Landing a solution [](https://blog.kaving.me/blog/tracking-down-a-25-regression-on-llvm-risc-v/#landing-a-solution)So we diagnosed the issue - the recent patch to InstCombine, which improved the logic, causes the pass to no longer be able to perform another optimization. We need to teach the InstCombiner to narrow an operation earlier with the uitofp/sitofp if there’s an fptrunc later anyways.

text

graph TD
    %% Define the Instruction Spine
    subgraph Flow [&#34;LLVM Data Flow&#34;]
        direction TB
        N0[&#34;%0 = urem i64 %mul, 74383<br/><i>Range Provider: 0 to 74,382</i>&#34;]
        N1[&#34;%conv2 = uitofp nneg i64 %0 to double<br/><i>Integer to Double</i>&#34;]
        N2[&#34;%div3 = fdiv double %conv2, 7.438300e+04<br/><i>Double Division</i>&#34;]
        N3[&#34;%conv4 = fptrunc double %div3 to float<br/><i>Double to Float</i>&#34;]
        %% Define the vertical chain
        N0 --> N1
        N1 --> N2
        N2 --> N3
    end
    %% Define the Optimizer Node on the side
    IC[&#34;<b>InstCombiner</b><br/>Reduces to float-precision<br/>math if inputs fit.&#34;]
    
    %% Styling: Using #AARRGGBB or #RRGGBBAA logic for Hextra
    %% fill:#1b5e2022 is a very transparent green (approx 13% opacity)
    style IC fill:#1b5e2022,stroke:#4caf50,stroke-width:2px,rx:10,ry:10
    
    %% Subgraph styling to keep it neutral
    style Flow fill:none,stroke:#888888,stroke-dasharray: 5 5
    %% Dotted arrow connections
    IC -.-> N0
    IC -.-> N1
    IC -.-> N3

I first raised an issue on Github. While I was confident that this was a valid issue, I wanted to double check that this warranted a fix. I do want to thank the author of that patch I mentioned earlier, @SavchenkoValeriy, as he was able to offer me guidance on solving this issue, as well as providing reviews to my PR. My very initial solution would’ve been pretty convoluted but he was able to offer me a much simpler approach.

GitHub Issue: https://github.com/llvm/llvm-project/issues/190503 ↗ PR: https://github.com/llvm/llvm-project/pull/190550 ↗ As pointed out by my reviewer, the current issue with isKnownExactCastIntToFP currently only checks if the casting instruction CastInst.

text

%0 = urem i64 %mul, 74383
  %conv2 = uitofp nneg i64 %0 to double
  %div3 = fdiv double %conv2, 7.438300e+04
  %conv4 = fptrunc double %div3 to float

The code below is the visitFPTrunc function You can see the switch statement with the different operations, including the FDiv, and this corresponds to the fdiv.d/fdiv.s. FPT corresponds to the fptrunc instruction, BO corresponds to FPT.getOperand(0), so it would refer to %conv2. We need to see if we can instead convert this uitofp into a float, so we need to modify getMinimumFPType to check for cast operations as well as fpext instructions.

text

Instruction InstCombinerImpl::visitFPTrunc(FPTruncInst &FPT) {
  if (Instruction I = commonCastTransforms(FPT))
    return I;
  
  ...
  Type Ty = FPT.getType();
  auto BO = dyncast<BinaryOperator>(FPT.getOperand(0));
  if (BO && BO->hasOneUse()) {
    Type LHSMinType = getMinimumFPType(BO->getOperand(0), PreferBFloat);
    Type RHSMinType = getMinimumFPType(BO->getOperand(1), PreferBFloat);
    switch (BO->getOpcode()) {
      default: break;
      case Instruction::FAdd:
      case Instruction::FSub:
      ...
  ...
}

One of my initial idea was to modify isKnownExactCastIntToFP to accept a parameter with a different Type (f32 in my case) but with a default nullptr value. This would allow us to modify just the header definition and its implementation. My reviewer instead proposed the idea of making a variant of isKnownExactCastIntToFP, canBeCastedExactlyIntToFP. This will do the actual analysis with the type given to it, and isKnownExactCastIntToFP can call it. I mention this to show that it’s good to interact with the community and seek ideas from others. They can come up with ideas which for a variety of reasons could be better.

Below is the final git diff. We split isKnownExactCastIntToFP and created canBeCastedExactlyIntToFP to perform the actual analysis, and isKnownExactCastIntToFP call it. Then, we have getMinimumFPType call canBeCastedExactlyIntToFP.

text

index bc52bf1168d4..2688891c1509 100644
--- a/llvm/include/llvm/Transforms/InstCombine/InstCombiner.h
+++ b/llvm/include/llvm/Transforms/InstCombine/InstCombiner.h
@@ -481,6 +481,8 @@ public:
   /// Return true if the cast from integer to FP can be proven to be exact
   /// for all possible inputs (the conversion does not lose any precision).
   bool isKnownExactCastIntToFP(CastInst &I) const;
+  bool canBeCastedExactlyIntToFP(Value V, Type FPTy, bool IsSigned,
+                                 const Instruction CxtI = nullptr) const;
   OverflowResult computeOverflowForUnsignedMul(const Value LHS,
                                                const Value RHS,
diff --git a/llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp b/llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
index e3c39a3c193e..0cd035bd1413 100644
--- a/llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
+++ b/llvm/lib/Transforms/InstCombine/InstCombineCasts.cpp
@@ -2039,10 +2039,17 @@ static Type shrinkFPConstantVector(Value V, bool PreferBFloat) {
 }
 /// Find the minimum FP type we can safely truncate to.
-static Type getMinimumFPType(Value V, bool PreferBFloat) {
+static Type getMinimumFPType(Value V, Type PreferredTy, InstCombiner &IC) {
   if (auto FPExt = dyncast<FPExtInst>(V))
     return FPExt->getOperand(0)->getType();
+  Value Src;
+  if (match(V, mIToFP(mValue(Src))) &&
+      IC.canBeCastedExactlyIntToFP(Src, PreferredTy, isa<SIToFPInst>(V),
+                                   cast<Instruction>(V)))
+    return PreferredTy;
+
+  bool PreferBFloat = PreferredTy->getScalarType()->isBFloatTy();

Notice how we now call canBeCastedExactlyIntToFP with the Type of the fptrunc instruction being passed in. Src in this case is the input to BO->getOperand(0), and BO is the 0th* operand of the fptrunc instruction. Remember the llvm ir earlier:

text

%0 = urem i64 %mul, 74383
%conv2 = uitofp nneg i64 %0 to double
%div3 = fdiv double %conv2, 7.438300e+04
%conv4 = fptrunc double %div3 to float

The fptrunc is converting the input into a float, so the type Ty is f32. BO is %div3, BO->getOperand(0) is %conv2 and the mIToFP(mValue(Src)) puts the input of BO->getOperand(0) into Src. The table below shows the mapping of the visitFPTrunc code to the LLVM IR variables.

visitFPTrunc Variables LLVM IR Variables Value BO %div3 fdiv double %conv2, 7.438300e+04 BO->getOperand(0) %conv2 uitofp nneg i64 %0 to double BO->getOperand(1) 7.438300e+04 double constant Src %0 urem i64 %mul, 74383

Result [](https://blog.kaving.me/blog/tracking-down-a-25-regression-on-llvm-risc-v/#result)Did this work? By running the command below we can see the LLVM IR after my patch.

text

$LLVMBUILDDIR/bin/clang -O3 \
  --target=riscv64-unknown-linux-gnu \
  -march=rv64gczbazbb \
  --sysroot=/usr/riscv64-linux-gnu \
  -S -emit-llvm pi.c -o pifixed.ll

And this is the relevant LLVM IR.

text

%mul = mul nuw nsw i64 %ixran.053, 27611
%0 = urem i64 %mul, 74383
%1 = uitofp nneg i64 %0 to float
%conv4 = fdiv float %1, 7.438300e+04

It worked! 🥹

The fptrunc is gone, and the cast operation uitofp now converts it into a float.

Shown below are the results after my patch was merged. We can see that the target can execute the benchmark in 1.67 Bn cycles, about a 25% improvement.

https://cc-perf.igalia.com/dbdefault/v4/nts/profile/260/426/422 ↗

Analysing a Benchmark on LLVM RISCV ↗

Source Attribution

OpenDocs keeps source fields explicit. Unknown values are labeled instead of invented.

Canonical source

https://docs.platphormnews.com/docs/tracking-down-a-25-percent-regression-on-llvm-risc-v-kg-and-number-39s-bb9e

Original source

https://blog.kaving.me/blog/tracking-down-a-25-regression-on-llvm-risc-v/

Source domain

blog.kaving.me

Author

Unknown

Publisher

blog.kaving.me

License / usage

Unknown. Review the original source terms before republishing beyond public-safe excerpts.

Score

Version docs-phase3-2026-05-20

Overall quality score, confidence 81%

Source credibility90

Freshness96

Completeness100

Extraction quality85

Attribution confidence90

Readability

standard | grade 10.9 | format 45

111 sentences, 0 headings, 0 list items.

Add descriptive headings to make the document easier to scan.

Use lists for steps, requirements, or extracted facts when appropriate.

Search Appearance

canonical document page

OpenDocs keeps inspected URLs, canonical URLs, snippets, and rich-result signals explicit; Search Console metrics are not treated as visits.

Social Card

Tracking down a 25% Regression on LLVM RISC-V – KG's Blog

Duplicate State

No duplicate is asserted on this page without a matching canonical URL or content hash cluster.

Trace

docs-score-7206f0bf7f74c35113e46e24

Export

Use public export endpoints for Markdown/JSON. Protected publishing still requires PLATPHORM_API_KEY.

dev blog api review video development

Categories

Resources

Tracking down a 25% Regression on LLVM RISC-V – KG's Blog

Related Documentation

Chert | iMessage Infrastructure for Reaching People at Scale

SEO Starter Guide: The Basics | Google Search Central | Documentation | Google for Developers

Chert | iMessage Infrastructure for Reaching People at Scale

Three Inverse Laws of AI - Susam Pal

GameStop Proposes to Acquire eBay at $125.00 Per Share | GameStop Corp.