Commit Graph

125 Commits

Author SHA1 Message Date
Rémi Denis-Courmont 449cab7b16 lavc/lpc: fix off-by-one in R-V V compute_autocorr
(cherry picked from commit af20fb9c4e)
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-05-28 19:47:04 +03:00
Rémi Denis-Courmont 2d514f5d48 lavc/flacdsp: do not assume maximum R-V VL
This loop correctly assumes that VLMAX=16 (4x128-bit vectors
with 32-bit elements) and 32 >= pred_order > 16. We need to alternate
between VL=16 and VL=t2=pred_order-16 elements to add up to pred_order.

The current code requests AVL=a2=pred_order elements. In QEMU and on
thte K230 hardware, this sets VL=16 as we need. But the specification
merely guarantees that we get: ceil(AVL / 2) <= VL <= VLMAX. For
instance, if pred_order equals 27, we could end up with VL=14 or VL=15
instead of VL=16. So instead, request literally VLMAX=16.

(cherry picked from commit f883746587)
Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-05-25 10:34:08 +03:00
Andreas Rheinhardt 88b3b09afa avcodec/aacenc: Move initializing DSP out of aacenc.c
Otherwise aacenc.o gets pulled in by the aacencdsp checkasm
test and it in turn pulls the rest of lavc in.
Besides being bad size-wise this also has the downside that
it pulls in avpriv_(cga|vga16)_font from libavutil which are
marked as being imported from another library when building
libavcodec as a DLL and this breaks checkasm because it links
both lavc and lavu statically.

Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>
2024-03-02 02:54:11 +01:00
sunyuechi a7ad76fbbf lavc/me_cmp: R-V V nsse
C908:
nsse_0_c: 1990.0
nsse_0_rvv_i32: 572.0
nsse_1_c: 910.0
nsse_1_rvv_i32: 456.0

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-27 20:31:30 +02:00
sunyuechi 9b90d0d36a lavc/me_cmp: R-V V vsse vsad intra
C908:
vsad_4_c: 681.0
vsad_4_rvv_i32: 182.2
vsad_5_c: 278.0
vsad_5_rvv_i32: 145.2
vsse_4_c: 595.0
vsse_4_rvv_i32: 125.2
vsse_5_c: 281.0
vsse_5_rvv_i32: 101.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-25 11:05:25 +02:00
sunyuechi 925b55a5e8 lavc/me_cmp: R-V V vsse vsad
C908:
vsad_0_c: 936.0
vsad_0_rvv_i32: 236.2
vsad_1_c: 424.0
vsad_1_rvv_i32: 190.2
vsse_0_c: 877.0
vsse_0_rvv_i32: 204.2
vsse_1_c: 439.0
vsse_1_rvv_i32: 140.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-25 11:05:25 +02:00
sunyuechi 9cb8f262f2 lavc/me_cmp: R-V V sse
C908:
sse_0_c: 614.7
sse_0_rvv_i32: 138.2
sse_1_c: 302.7
sse_1_rvv_i32: 107.2
sse_2_c: 175.7
sse_2_rvv_i32: 104.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-21 20:08:55 +02:00
sunyuechi 37463d7979 lavc/me_cmp: R-V V pix_abs_y2
C908:
pix_abs_0_2_c: 904.0
pix_abs_0_2_rvv_i32: 172.2
pix_abs_1_2_c: 460.0
pix_abs_1_2_rvv_i32: 168.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-21 20:08:25 +02:00
sunyuechi f1ec475f66 lavc/me_cmp: R-V V pix_abs_x2
C908:
pix_abs_0_1_c: 767.0
pix_abs_0_1_rvv_i32: 196.2
pix_abs_1_1_c: 388.0
pix_abs_1_1_rvv_i32: 185.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-21 20:08:25 +02:00
sunyuechi b41e115dde lavc/me_cmp: R-V V pix_abs
C908:
pix_abs_0_0_c: 534.0
pix_abs_0_0_rvv_i32: 136.2
pix_abs_1_0_c: 287.7
pix_abs_1_0_rvv_i32: 125.2
sad_0_c: 534.0
sad_0_rvv_i32: 136.2
sad_1_c: 287.7
sad_1_rvv_i32: 125.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-21 20:08:25 +02:00
sunyuechi d897bbb48d lavc/vp8dsp: R-V V vp8_idct_dc_add4uv
c908:
vp8_idct_dc_add4uv_c: 387.7
vp8_idct_dc_add4uv_rvv_i32: 134.5

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-17 14:45:49 +02:00
sunyuechi e74e18cae4 lavc/vp8dsp: R-V V vp8_idct_dc_add4y
c908:
vp8_idct_dc_add4y_c: 368.5
vp8_idct_dc_add4y_rvv_i32: 134.5

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-17 14:45:49 +02:00
sunyuechi c12053cefc lavc/vp8dsp: R-V V vp8_idct_dc_add
c908:
vp8_idct_dc_add_c: 102.2
vp8_idct_dc_add_rvv_i32: 42.0

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-17 14:45:49 +02:00
sunyuechi 89189dd9e7 lavc/rv34dsp: R-V V rv34_idct_dc_add
C908:
rv34_idct_dc_add_c: 134.7
rv34_idct_dc_add_rvv_i32: 45.5

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-17 14:33:35 +02:00
sunyuechi ee08974f90 lavc/rv34dsp: R-V V rv34_inv_transform_dc
C908:
rv34_inv_transform_dc_c: 35.5
rv34_inv_transform_dc_rvv_i32: 27.0

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-17 14:33:35 +02:00
sunyuechi fdebde817c lavc/blockdsp: R-V V clear_blocks
C908:
blockdsp.clear_blocks_c: 128.2
blockdsp.clear_blocks_rvv_i64: 102.5

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-13 21:29:46 +02:00
sunyuechi 0748d2bbc7 lavc/blockdsp: R-V V clear_block
C908:
blockdsp.clear_block_c: 47.2
blockdsp.clear_block_rvv_i64: 28.5

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-02-12 22:00:03 +02:00
sunyuechi 8e23ebe6f9 lavc/svq1enc: R-V V ssd_int8_vs_int16
C908
ssd_int8_vs_int16_c: 207.7
ssd_int8_vs_int16_rvv_i32: 14.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2024-01-17 17:49:54 +02:00
Rémi Denis-Courmont 278b4b60d6 lavc/takdsp: R-V V decorrelate_sf
decorrelate_sf_c:      259.2
decorrelate_sf_rvv_i32: 45.5
2024-01-15 19:00:25 +02:00
sunyuechi 3d39b8d4e7 lavc/takdsp: R-V V decorrelate_sm
C908:
decorrelate_sm_c: 130.0
decorrelate_sm_rvv_i32: 43.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
(with minor changes)
2023-12-22 17:40:00 +02:00
James Almer 46775e64f8 avcodec/takdsp: fix const correctness
Signed-off-by: James Almer <jamrial@gmail.com>
2023-12-22 09:28:04 -03:00
sunyuechi c933ff2779 lavc/takdsp: R-V V decorrelate_sr
C908:
decorrelate_sr_c: 95.5
decorrelate_sr_rvv_i32: 28.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2023-12-21 22:42:34 +02:00
sunyuechi 864174dd00 lavc/takdsp: R-V V decorrelate_ls
C908:
decorrelate_ls_c: 69.7
decorrelate_ls_rvv_i32: 27.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2023-12-21 22:42:34 +02:00
Rémi Denis-Courmont cdd38a2ffe lavc/aacpsdsp: fix R-V V stereo interpolate
The penultimate loop iteration could pick any vl such that:
 vlenb/4 < vl <= vlenb/2
Thus if the total length is not a multiple of vlenb/2, the vfadd.vf
on the penultimate iteration would yield corrupt values for the last
iteration.

To avoid this, force vl = vlen/2 until the last iteration. Unfortunately
this latent bug is not reproducible with either hardware or QEMU as of now.
2023-12-21 17:54:23 +02:00
Rémi Denis-Courmont db32f75c63 lavc/opusdsp: simplify R-V V postfilter
This skips the round-trip to scalar register for the sliding 'x'
coefficients, improving performance by about 5%. The trick here is that
the vector slide-up instruction preserves elements in destination vector
until the slide offset.

The switch from vfslide1up.vf to vslideup.vi also allows the elimination
of data dependencies on consecutive slides. Since the specifications
recommend sticking to power of two offsets, we could slide as follows:

        vslideup.vi v8, v0, 2
        vslideup.vi v4, v0, 1
        vslideup.vi v12, v8, 1
        vslideup.vi v16, v8, 2

However in the device under test, this seems to make performance slightly
worse, so this is left for (in)validation with future better hardware.
2023-12-21 17:54:08 +02:00
Rémi Denis-Courmont 419145c11b lavc/vc1dsp: fix R-V V vector lengths
The 8x4 and 4x4 use a needlessly large multiplier (unless/until we care
about embedded 64-bit-vector hardware). This is merely suboptimal.

The 8x4 case also uses an incorrect vector length, which leads to incorrect
behaviour on future/hypothetical hardware with 256-bit or larger vectors.

Pointed-out-by: Martin Storsjö <martin@martin.st>
2023-12-17 09:27:52 +02:00
Martin Storsjö b51d9eb58e riscv: vc1dsp: Don't check vlenb before checking the CPU flags
We can't call ff_get_rv_vlenb() if we don't have RVV available
at all.

Acked-by: Rémi Denis-Courmont <remi@remlab.net>
Signed-off-by: Martin Storsjö <martin@martin.st>
2023-12-16 22:30:26 +02:00
Rémi Denis-Courmont 918b3ed2d5 lavc/lpc: R-V V compute_autocorr
The loop iterates over the length of the vector, not the order. This is
to avoid reloading the same data for each lag value. However this means
the loop only works if the maximum order is no larger than VLENB.

The loop is roughly equivalent to:

    for (size_t j = 0; j < lag; j++)
        autoc[j] = 1.;

    while (len > lag) {
        for (ptrdiff_t j = 0; j < lag; j++)
            autoc[j] += data[j] * *data;
        data++;
        len--;
    }

    while (len > 0) {
        for (ptrdiff_t j = 0; j < len; j++)
            autoc[j] += data[j] * *data;
        data++;
        len--;
    }

Since register pressure is only at 50%, it should be possible to implement
the same loop for order up to 2xVLENB. But this is left for future work.

Performance numbers are all over the place from ~1.25x to ~4x speedups,
but at least they are always noticeably better than nothing.
2023-12-16 11:18:01 +02:00
sunyuechi 98596f90f4 lavc/aacencdsp: R-V V abs_pow34
C908:
abs_pow34_c: 535.5
abs_pow34_rvv_f32: 337.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2023-12-11 18:42:07 +02:00
Rémi Denis-Courmont 272d0c164d lavc/lpc: R-V V apply_welch_window
apply_welch_window_even_c:       617.5
apply_welch_window_even_rvv_f64: 235.0
apply_welch_window_odd_c:        709.0
apply_welch_window_odd_rvv_f64:  256.5
2023-12-11 18:17:43 +02:00
Rémi Denis-Courmont b3825bbe45 riscv: test for assembler support
This should fix the build on LLVM 16 and earlier, at the cost of turning
all non-RVV optimisations off.
2023-12-08 17:21:09 +02:00
sunyuechi 0b9d009b4a lavc/vc1dsp: R-V V inv_trans
C908:
vc1dsp.vc1_inv_trans_4x4_dc_c:      125.7
vc1dsp.vc1_inv_trans_4x4_dc_rvv_i32: 53.5
vc1dsp.vc1_inv_trans_4x8_dc_c:      230.7
vc1dsp.vc1_inv_trans_4x8_dc_rvv_i32: 65.5
vc1dsp.vc1_inv_trans_8x4_dc_c:      228.7
vc1dsp.vc1_inv_trans_8x4_dc_rvv_i64: 64.5
vc1dsp.vc1_inv_trans_8x8_dc_c:      476.5
vc1dsp.vc1_inv_trans_8x8_dc_rvv_i64: 80.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2023-12-08 17:20:48 +02:00
sunyuechi 8bdb663062 lavc/ac3dsp: R-V V float_to_fixed24
c910
    float_to_fixed24_c: 2207.2
    float_to_fixed24_rvv_f32: 696.2

Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>
2023-12-06 16:04:22 +02:00
Rémi Denis-Courmont 0fa421c8f1 lavc/llvidencdsp: add R-V V diff_bytes
diff_bytes_c:      163.0
diff_bytes_rvv_i32: 52.7
2023-11-23 18:57:18 +02:00
Rémi Denis-Courmont 0183c2c830 lavc/aacpsdsp: use LMUL=2 and amortise strides
The input is laid out in 16 segments, of which 13 actually need to be
loaded. There are no really efficient ways to deal with this:
1) If we load 8 segments wit unit stride, then narrow to 16 segments with
   right shifts, we can only get one half-size vector per segment, or just 2
   elements per vector (EMUL=1/2) - at least with 128-bit vectors.
   This ends up unsurprisingly about as fas as the C code.
2) The current approach is to load with strides. We keep that approach,
   but improve it using three 4-segmented loads instead of 12 single-segment
   loads. This divides the number of distinct loaded addresses by 4.
3) A potential third approach would be to avoid segmentation altogether
   and splat the scalar coefficient into vectors. Then we can use a
   unit-stride and maximum EMUL. But the downside then is that we have to
   multiply the 3 (of 16) unused segments with zero as part of the
   multiply-accumulate operations.

In addition, we also reuse vectors mid-loop so as to increase the EMUL
from 1 to 2, which also improves performance a little bit.

Oeverall the gains are quite small with the device under test, as it does
not deal with segmented loads very well. But at least the code is tidier,
and should enjoy bigger speed-ups on better hardware implementation.

Before:
ps_hybrid_analysis_c:       1819.2
ps_hybrid_analysis_rvv_f32: 1037.0 (before)
ps_hybrid_analysis_rvv_f32:  990.0 (after)
2023-11-23 18:57:18 +02:00
Rémi Denis-Courmont b88d4058f9 lavc/g722dsp: optimise R-V V apply_qmf
This stores the constant coefficients deinterleaved, so that they can be
loaded directly with NF=0. Unfortunately, we cannot optimise loading the
input, due to insufficient memory alignment (not 32-bit).

Before:
g722_apply_qmf_c:       82.5
g722_apply_qmf_rvv_i32: 78.2

After:
g722_apply_qmf_c:       82.5
g722_apply_qmf_rvv_i32: 65.2
2023-11-23 18:57:18 +02:00
Rémi Denis-Courmont fbc7adba67 lavc/llviddsp: R-V V add_bytes
add_bytes_c:      2077.2
add_bytes_rvv_i32: 105.0
2023-11-18 22:07:14 +02:00
Rémi Denis-Courmont ca664f2254 lavc/flacdsp: R-V V LPC16 function
In this case, the inner loop computing the scalar product can be reduced
to just one multiplication and one sum even with 128-bit vectors. The
result is a lot simpler, but also brings more modest performance gains:

flac_lpc_16_13_c:       15241.0
flac_lpc_16_13_rvv_i32: 11230.0
flac_lpc_16_16_c:       17884.0
flac_lpc_16_16_rvv_i32: 12125.7
flac_lpc_16_29_c:       27847.7
flac_lpc_16_29_rvv_i32: 10494.0
flac_lpc_16_32_c:       30051.5
flac_lpc_16_32_rvv_i32: 10355.0
2023-11-18 22:06:57 +02:00
Rémi Denis-Courmont 295092b46d lavc/flacdsp: R-V V LPC32
The entire set of 32 coefficients and corresponding past 32 samples can
fit in a single vector (with LMUL=8) exactly, but... since widening
double the needed vector sizes, we still end up too short with 128-bit
vectors. This adds a very simple version for future 256+-bit hardware,
and for pred_orders values up to 16, and a bit more involved loop for
for 128-bit hardware with pred_orders between 17 and 32.

With 128-bit hardware, the benchmarks look like this:
flac_lpc_32_13_c:       30152.0
flac_lpc_32_13_rvv_i32: 10244.7
flac_lpc_32_16_c:       37314.2
flac_lpc_32_16_rvv_i32: 10126.2
flac_lpc_32_29_c:       61910.0
flac_lpc_32_29_rvv_i32: 14495.2
flac_lpc_32_32_c:       68204.0
flac_lpc_32_32_rvv_i32: 13273.7
2023-11-18 22:05:43 +02:00
Rémi Denis-Courmont 07c303b708 lavc/flacdsp: R-V V decorrelate_indep 16-bit packed
flac_decorrelate_indep2_16_c:        981.7
flac_decorrelate_indep2_16_rvv_i32:  199.2
flac_decorrelate_indep4_16_c:       1749.7
flac_decorrelate_indep4_16_rvv_i32:  401.2
flac_decorrelate_indep6_16_c:       2517.7
flac_decorrelate_indep6_16_rvv_i32:  858.0
flac_decorrelate_indep8_16_c:       3285.7
flac_decorrelate_indep8_16_rvv_i32: 1123.5
2023-11-17 23:59:56 +02:00
Rémi Denis-Courmont fb0295e5fd lavc/flacdsp: R-V V decorrelate_indep 32-bit packed
flac_decorrelate_indep2_32_c:       981.7
flac_decorrelate_indep2_32_rvv_i32: 183.7
flac_decorrelate_indep4_32_c:      1749.7
flac_decorrelate_indep4_32_rvv_i32: 362.5
flac_decorrelate_indep6_32_c:      2517.7
flac_decorrelate_indep6_32_rvv_i32: 715.2
flac_decorrelate_indep8_32_c:      3285.7
flac_decorrelate_indep8_32_rvv_i32: 909.0
2023-11-17 23:59:56 +02:00
Rémi Denis-Courmont 6183a69c0b lavc/flacdsp: R-V V decorrelate_ms packed
flac_decorrelate_ms_16_c:       585.5
flac_decorrelate_ms_16_rvv_i32: 263.0
flac_decorrelate_ms_32_c:       584.7
flac_decorrelate_ms_32_rvv_i32: 250.0
2023-11-17 23:59:23 +02:00
Rémi Denis-Courmont 636ae0e0bc lavc/flacdsp: R-V V packed decorrelate_{l,r}s
flac_decorrelate_ms_16_c:       457.2
flac_decorrelate_ms_16_rvv_i32: 203.0
flac_decorrelate_ms_32_c:       457.2
flac_decorrelate_ms_32_rvv_i32: 203.5
flac_decorrelate_rs_16_c:       456.2
flac_decorrelate_rs_16_rvv_i32: 207.0
flac_decorrelate_rs_32_c:       456.2
flac_decorrelate_rs_32_rvv_i32: 210.5
2023-11-17 23:59:22 +02:00
Rémi Denis-Courmont d076517056 lavc/llauddsp: R-V V scalarproduct_and_madd_int32
scalarproduct_and_madd_int32_c:      10899.7
scalarproduct_and_madd_int32_rvv_i32: 1749.0
2023-11-16 16:53:44 +02:00
Rémi Denis-Courmont 45d0eb3f70 lavc/llauddsp: R-V V scalarproduct_and_madd_int16
scalarproduct_and_madd_int16_c:      10355.7
scalarproduct_and_madd_int16_rvv_i32: 1480.0
2023-11-16 16:53:44 +02:00
Rémi Denis-Courmont 90a779bed6 lavc/huffyuvdsp: basic R-V V add_hfyu_left_pred_bgr32
Better performance can probably be achieved with a more intricate
unrolled loop, but this is a start:

add_hfyu_left_pred_bgr32_c: 15084.0
add_hfyu_left_pred_bgr32_rvv_i32: 10280.2

This would actually be cleaner with the RISC-V P extension, but that is
not ratified yet (I think?) and usually not supported if V is supported.
2023-11-15 16:51:07 +02:00
Rémi Denis-Courmont c536e92207 lavc/sbrdsp: R-V V hf_apply_noise functions
This is restricted to 128-bit vectors as larger vector sizes could read
past the end of the noise array. Support for future hardware with larger
vector sizes is left for some other time.

hf_apply_noise_0_c:       2319.7
hf_apply_noise_0_rvv_f32: 1229.0
hf_apply_noise_1_c:       2539.0
hf_apply_noise_1_rvv_f32: 1244.7
hf_apply_noise_2_c:       2319.7
hf_apply_noise_2_rvv_f32: 1232.7
hf_apply_noise_3_c:       2541.2
hf_apply_noise_3_rvv_f32: 1244.2
2023-11-13 18:34:29 +02:00
Rémi Denis-Courmont 5b33104fca lavc/sbrdsp: R-V V hf_gen
hf_gen_c:      2922.7
hf_gen_rvv_f32: 731.5
2023-11-13 18:33:02 +02:00
Rémi Denis-Courmont cd7b352c53 lavc/sbrdsp: R-V V autocorrelate
With 5 accumulator vectors and 6 inputs, this can only use LMUL=2.
Also the number of vector loop iterations is small, just 5 on 128-bit
vector hardware.

The vector loop is somewhat unusual in that it processes data in
descending memory order, in order to save on vector slides:
in descending order, we can extract elements to carry over to the next
iteration from the bottom of the vectors directly. With ascending order
(see in the Opus postfilter function), there are no ways to get the top
elements directly. On the downside, this requires the use of separate
shift and sub (the would-be SH3SUB instruction does not exist), with
a small pipeline stall on the vector load address.

The edge cases in scalar are done in scalar as this saves on loads
and remains significantly faster than C.

autocorrelate_c: 669.2
autocorrelate_rvv_f32: 421.0
2023-11-12 14:03:09 +02:00
Rémi Denis-Courmont f576a0835b lavc/aacpsdsp: rework R-V V hybrid_synthesis_deint
Given the size of the data set, strided memory accesses cannot be avoided.
We can still do better than the current code.

ps_hybrid_synthesis_deint_c:       12065.5
ps_hybrid_synthesis_deint_rvv_i32: 13650.2 (before)
ps_hybrid_synthesis_deint_rvv_i64:  8181.0 (after)
2023-11-12 14:03:09 +02:00