FFmpeg

Commit Graph

Author	SHA1	Message	Date
Rémi Denis-Courmont	449cab7b16	lavc/lpc: fix off-by-one in R-V V compute_autocorr (cherry picked from commit `af20fb9c4e`) Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-05-28 19:47:04 +03:00
Rémi Denis-Courmont	2d514f5d48	lavc/flacdsp: do not assume maximum R-V VL This loop correctly assumes that VLMAX=16 (4x128-bit vectors with 32-bit elements) and 32 >= pred_order > 16. We need to alternate between VL=16 and VL=t2=pred_order-16 elements to add up to pred_order. The current code requests AVL=a2=pred_order elements. In QEMU and on thte K230 hardware, this sets VL=16 as we need. But the specification merely guarantees that we get: ceil(AVL / 2) <= VL <= VLMAX. For instance, if pred_order equals 27, we could end up with VL=14 or VL=15 instead of VL=16. So instead, request literally VLMAX=16. (cherry picked from commit `f883746587`) Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-05-25 10:34:08 +03:00
Andreas Rheinhardt	88b3b09afa	avcodec/aacenc: Move initializing DSP out of aacenc.c Otherwise aacenc.o gets pulled in by the aacencdsp checkasm test and it in turn pulls the rest of lavc in. Besides being bad size-wise this also has the downside that it pulls in avpriv_(cga\|vga16)_font from libavutil which are marked as being imported from another library when building libavcodec as a DLL and this breaks checkasm because it links both lavc and lavu statically. Signed-off-by: Andreas Rheinhardt <andreas.rheinhardt@outlook.com>	2024-03-02 02:54:11 +01:00
sunyuechi	a7ad76fbbf	lavc/me_cmp: R-V V nsse C908: nsse_0_c: 1990.0 nsse_0_rvv_i32: 572.0 nsse_1_c: 910.0 nsse_1_rvv_i32: 456.0 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-02-27 20:31:30 +02:00
sunyuechi	9b90d0d36a	lavc/me_cmp: R-V V vsse vsad intra C908: vsad_4_c: 681.0 vsad_4_rvv_i32: 182.2 vsad_5_c: 278.0 vsad_5_rvv_i32: 145.2 vsse_4_c: 595.0 vsse_4_rvv_i32: 125.2 vsse_5_c: 281.0 vsse_5_rvv_i32: 101.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-02-25 11:05:25 +02:00
sunyuechi	925b55a5e8	lavc/me_cmp: R-V V vsse vsad C908: vsad_0_c: 936.0 vsad_0_rvv_i32: 236.2 vsad_1_c: 424.0 vsad_1_rvv_i32: 190.2 vsse_0_c: 877.0 vsse_0_rvv_i32: 204.2 vsse_1_c: 439.0 vsse_1_rvv_i32: 140.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-02-25 11:05:25 +02:00
sunyuechi	9cb8f262f2	lavc/me_cmp: R-V V sse C908: sse_0_c: 614.7 sse_0_rvv_i32: 138.2 sse_1_c: 302.7 sse_1_rvv_i32: 107.2 sse_2_c: 175.7 sse_2_rvv_i32: 104.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-02-21 20:08:55 +02:00
sunyuechi	37463d7979	lavc/me_cmp: R-V V pix_abs_y2 C908: pix_abs_0_2_c: 904.0 pix_abs_0_2_rvv_i32: 172.2 pix_abs_1_2_c: 460.0 pix_abs_1_2_rvv_i32: 168.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-02-21 20:08:25 +02:00
sunyuechi	f1ec475f66	lavc/me_cmp: R-V V pix_abs_x2 C908: pix_abs_0_1_c: 767.0 pix_abs_0_1_rvv_i32: 196.2 pix_abs_1_1_c: 388.0 pix_abs_1_1_rvv_i32: 185.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-02-21 20:08:25 +02:00
sunyuechi	b41e115dde	lavc/me_cmp: R-V V pix_abs C908: pix_abs_0_0_c: 534.0 pix_abs_0_0_rvv_i32: 136.2 pix_abs_1_0_c: 287.7 pix_abs_1_0_rvv_i32: 125.2 sad_0_c: 534.0 sad_0_rvv_i32: 136.2 sad_1_c: 287.7 sad_1_rvv_i32: 125.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-02-21 20:08:25 +02:00
sunyuechi	d897bbb48d	lavc/vp8dsp: R-V V vp8_idct_dc_add4uv c908: vp8_idct_dc_add4uv_c: 387.7 vp8_idct_dc_add4uv_rvv_i32: 134.5 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-02-17 14:45:49 +02:00
sunyuechi	e74e18cae4	lavc/vp8dsp: R-V V vp8_idct_dc_add4y c908: vp8_idct_dc_add4y_c: 368.5 vp8_idct_dc_add4y_rvv_i32: 134.5 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-02-17 14:45:49 +02:00
sunyuechi	c12053cefc	lavc/vp8dsp: R-V V vp8_idct_dc_add c908: vp8_idct_dc_add_c: 102.2 vp8_idct_dc_add_rvv_i32: 42.0 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-02-17 14:45:49 +02:00
sunyuechi	89189dd9e7	lavc/rv34dsp: R-V V rv34_idct_dc_add C908: rv34_idct_dc_add_c: 134.7 rv34_idct_dc_add_rvv_i32: 45.5 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-02-17 14:33:35 +02:00
sunyuechi	ee08974f90	lavc/rv34dsp: R-V V rv34_inv_transform_dc C908: rv34_inv_transform_dc_c: 35.5 rv34_inv_transform_dc_rvv_i32: 27.0 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-02-17 14:33:35 +02:00
sunyuechi	fdebde817c	lavc/blockdsp: R-V V clear_blocks C908: blockdsp.clear_blocks_c: 128.2 blockdsp.clear_blocks_rvv_i64: 102.5 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-02-13 21:29:46 +02:00
sunyuechi	0748d2bbc7	lavc/blockdsp: R-V V clear_block C908: blockdsp.clear_block_c: 47.2 blockdsp.clear_block_rvv_i64: 28.5 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-02-12 22:00:03 +02:00
sunyuechi	8e23ebe6f9	lavc/svq1enc: R-V V ssd_int8_vs_int16 C908 ssd_int8_vs_int16_c: 207.7 ssd_int8_vs_int16_rvv_i32: 14.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2024-01-17 17:49:54 +02:00
Rémi Denis-Courmont	278b4b60d6	lavc/takdsp: R-V V decorrelate_sf decorrelate_sf_c: 259.2 decorrelate_sf_rvv_i32: 45.5	2024-01-15 19:00:25 +02:00
sunyuechi	3d39b8d4e7	lavc/takdsp: R-V V decorrelate_sm C908: decorrelate_sm_c: 130.0 decorrelate_sm_rvv_i32: 43.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net> (with minor changes)	2023-12-22 17:40:00 +02:00
James Almer	46775e64f8	avcodec/takdsp: fix const correctness Signed-off-by: James Almer <jamrial@gmail.com>	2023-12-22 09:28:04 -03:00
sunyuechi	c933ff2779	lavc/takdsp: R-V V decorrelate_sr C908: decorrelate_sr_c: 95.5 decorrelate_sr_rvv_i32: 28.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2023-12-21 22:42:34 +02:00
sunyuechi	864174dd00	lavc/takdsp: R-V V decorrelate_ls C908: decorrelate_ls_c: 69.7 decorrelate_ls_rvv_i32: 27.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2023-12-21 22:42:34 +02:00
Rémi Denis-Courmont	cdd38a2ffe	lavc/aacpsdsp: fix R-V V stereo interpolate The penultimate loop iteration could pick any vl such that: vlenb/4 < vl <= vlenb/2 Thus if the total length is not a multiple of vlenb/2, the vfadd.vf on the penultimate iteration would yield corrupt values for the last iteration. To avoid this, force vl = vlen/2 until the last iteration. Unfortunately this latent bug is not reproducible with either hardware or QEMU as of now.	2023-12-21 17:54:23 +02:00
Rémi Denis-Courmont	db32f75c63	lavc/opusdsp: simplify R-V V postfilter This skips the round-trip to scalar register for the sliding 'x' coefficients, improving performance by about 5%. The trick here is that the vector slide-up instruction preserves elements in destination vector until the slide offset. The switch from vfslide1up.vf to vslideup.vi also allows the elimination of data dependencies on consecutive slides. Since the specifications recommend sticking to power of two offsets, we could slide as follows: vslideup.vi v8, v0, 2 vslideup.vi v4, v0, 1 vslideup.vi v12, v8, 1 vslideup.vi v16, v8, 2 However in the device under test, this seems to make performance slightly worse, so this is left for (in)validation with future better hardware.	2023-12-21 17:54:08 +02:00
Rémi Denis-Courmont	419145c11b	lavc/vc1dsp: fix R-V V vector lengths The 8x4 and 4x4 use a needlessly large multiplier (unless/until we care about embedded 64-bit-vector hardware). This is merely suboptimal. The 8x4 case also uses an incorrect vector length, which leads to incorrect behaviour on future/hypothetical hardware with 256-bit or larger vectors. Pointed-out-by: Martin Storsjö <martin@martin.st>	2023-12-17 09:27:52 +02:00
Martin Storsjö	b51d9eb58e	riscv: vc1dsp: Don't check vlenb before checking the CPU flags We can't call ff_get_rv_vlenb() if we don't have RVV available at all. Acked-by: Rémi Denis-Courmont <remi@remlab.net> Signed-off-by: Martin Storsjö <martin@martin.st>	2023-12-16 22:30:26 +02:00
Rémi Denis-Courmont	918b3ed2d5	lavc/lpc: R-V V compute_autocorr The loop iterates over the length of the vector, not the order. This is to avoid reloading the same data for each lag value. However this means the loop only works if the maximum order is no larger than VLENB. The loop is roughly equivalent to: for (size_t j = 0; j < lag; j++) autoc[j] = 1.; while (len > lag) { for (ptrdiff_t j = 0; j < lag; j++) autoc[j] += data[j] * data; data++; len--; } while (len > 0) { for (ptrdiff_t j = 0; j < len; j++) autoc[j] += data[j] *data; data++; len--; } Since register pressure is only at 50%, it should be possible to implement the same loop for order up to 2xVLENB. But this is left for future work. Performance numbers are all over the place from ~1.25x to ~4x speedups, but at least they are always noticeably better than nothing.	2023-12-16 11:18:01 +02:00
sunyuechi	98596f90f4	lavc/aacencdsp: R-V V abs_pow34 C908: abs_pow34_c: 535.5 abs_pow34_rvv_f32: 337.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2023-12-11 18:42:07 +02:00
Rémi Denis-Courmont	272d0c164d	lavc/lpc: R-V V apply_welch_window apply_welch_window_even_c: 617.5 apply_welch_window_even_rvv_f64: 235.0 apply_welch_window_odd_c: 709.0 apply_welch_window_odd_rvv_f64: 256.5	2023-12-11 18:17:43 +02:00
Rémi Denis-Courmont	b3825bbe45	riscv: test for assembler support This should fix the build on LLVM 16 and earlier, at the cost of turning all non-RVV optimisations off.	2023-12-08 17:21:09 +02:00
sunyuechi	0b9d009b4a	lavc/vc1dsp: R-V V inv_trans C908: vc1dsp.vc1_inv_trans_4x4_dc_c: 125.7 vc1dsp.vc1_inv_trans_4x4_dc_rvv_i32: 53.5 vc1dsp.vc1_inv_trans_4x8_dc_c: 230.7 vc1dsp.vc1_inv_trans_4x8_dc_rvv_i32: 65.5 vc1dsp.vc1_inv_trans_8x4_dc_c: 228.7 vc1dsp.vc1_inv_trans_8x4_dc_rvv_i64: 64.5 vc1dsp.vc1_inv_trans_8x8_dc_c: 476.5 vc1dsp.vc1_inv_trans_8x8_dc_rvv_i64: 80.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2023-12-08 17:20:48 +02:00
sunyuechi	8bdb663062	lavc/ac3dsp: R-V V float_to_fixed24 c910 float_to_fixed24_c: 2207.2 float_to_fixed24_rvv_f32: 696.2 Signed-off-by: Rémi Denis-Courmont <remi@remlab.net>	2023-12-06 16:04:22 +02:00
Rémi Denis-Courmont	0fa421c8f1	lavc/llvidencdsp: add R-V V diff_bytes diff_bytes_c: 163.0 diff_bytes_rvv_i32: 52.7	2023-11-23 18:57:18 +02:00
Rémi Denis-Courmont	0183c2c830	lavc/aacpsdsp: use LMUL=2 and amortise strides The input is laid out in 16 segments, of which 13 actually need to be loaded. There are no really efficient ways to deal with this: 1) If we load 8 segments wit unit stride, then narrow to 16 segments with right shifts, we can only get one half-size vector per segment, or just 2 elements per vector (EMUL=1/2) - at least with 128-bit vectors. This ends up unsurprisingly about as fas as the C code. 2) The current approach is to load with strides. We keep that approach, but improve it using three 4-segmented loads instead of 12 single-segment loads. This divides the number of distinct loaded addresses by 4. 3) A potential third approach would be to avoid segmentation altogether and splat the scalar coefficient into vectors. Then we can use a unit-stride and maximum EMUL. But the downside then is that we have to multiply the 3 (of 16) unused segments with zero as part of the multiply-accumulate operations. In addition, we also reuse vectors mid-loop so as to increase the EMUL from 1 to 2, which also improves performance a little bit. Oeverall the gains are quite small with the device under test, as it does not deal with segmented loads very well. But at least the code is tidier, and should enjoy bigger speed-ups on better hardware implementation. Before: ps_hybrid_analysis_c: 1819.2 ps_hybrid_analysis_rvv_f32: 1037.0 (before) ps_hybrid_analysis_rvv_f32: 990.0 (after)	2023-11-23 18:57:18 +02:00
Rémi Denis-Courmont	b88d4058f9	lavc/g722dsp: optimise R-V V apply_qmf This stores the constant coefficients deinterleaved, so that they can be loaded directly with NF=0. Unfortunately, we cannot optimise loading the input, due to insufficient memory alignment (not 32-bit). Before: g722_apply_qmf_c: 82.5 g722_apply_qmf_rvv_i32: 78.2 After: g722_apply_qmf_c: 82.5 g722_apply_qmf_rvv_i32: 65.2	2023-11-23 18:57:18 +02:00
Rémi Denis-Courmont	fbc7adba67	lavc/llviddsp: R-V V add_bytes add_bytes_c: 2077.2 add_bytes_rvv_i32: 105.0	2023-11-18 22:07:14 +02:00
Rémi Denis-Courmont	ca664f2254	lavc/flacdsp: R-V V LPC16 function In this case, the inner loop computing the scalar product can be reduced to just one multiplication and one sum even with 128-bit vectors. The result is a lot simpler, but also brings more modest performance gains: flac_lpc_16_13_c: 15241.0 flac_lpc_16_13_rvv_i32: 11230.0 flac_lpc_16_16_c: 17884.0 flac_lpc_16_16_rvv_i32: 12125.7 flac_lpc_16_29_c: 27847.7 flac_lpc_16_29_rvv_i32: 10494.0 flac_lpc_16_32_c: 30051.5 flac_lpc_16_32_rvv_i32: 10355.0	2023-11-18 22:06:57 +02:00
Rémi Denis-Courmont	295092b46d	lavc/flacdsp: R-V V LPC32 The entire set of 32 coefficients and corresponding past 32 samples can fit in a single vector (with LMUL=8) exactly, but... since widening double the needed vector sizes, we still end up too short with 128-bit vectors. This adds a very simple version for future 256+-bit hardware, and for pred_orders values up to 16, and a bit more involved loop for for 128-bit hardware with pred_orders between 17 and 32. With 128-bit hardware, the benchmarks look like this: flac_lpc_32_13_c: 30152.0 flac_lpc_32_13_rvv_i32: 10244.7 flac_lpc_32_16_c: 37314.2 flac_lpc_32_16_rvv_i32: 10126.2 flac_lpc_32_29_c: 61910.0 flac_lpc_32_29_rvv_i32: 14495.2 flac_lpc_32_32_c: 68204.0 flac_lpc_32_32_rvv_i32: 13273.7	2023-11-18 22:05:43 +02:00
Rémi Denis-Courmont	07c303b708	lavc/flacdsp: R-V V decorrelate_indep 16-bit packed flac_decorrelate_indep2_16_c: 981.7 flac_decorrelate_indep2_16_rvv_i32: 199.2 flac_decorrelate_indep4_16_c: 1749.7 flac_decorrelate_indep4_16_rvv_i32: 401.2 flac_decorrelate_indep6_16_c: 2517.7 flac_decorrelate_indep6_16_rvv_i32: 858.0 flac_decorrelate_indep8_16_c: 3285.7 flac_decorrelate_indep8_16_rvv_i32: 1123.5	2023-11-17 23:59:56 +02:00
Rémi Denis-Courmont	fb0295e5fd	lavc/flacdsp: R-V V decorrelate_indep 32-bit packed flac_decorrelate_indep2_32_c: 981.7 flac_decorrelate_indep2_32_rvv_i32: 183.7 flac_decorrelate_indep4_32_c: 1749.7 flac_decorrelate_indep4_32_rvv_i32: 362.5 flac_decorrelate_indep6_32_c: 2517.7 flac_decorrelate_indep6_32_rvv_i32: 715.2 flac_decorrelate_indep8_32_c: 3285.7 flac_decorrelate_indep8_32_rvv_i32: 909.0	2023-11-17 23:59:56 +02:00
Rémi Denis-Courmont	6183a69c0b	lavc/flacdsp: R-V V decorrelate_ms packed flac_decorrelate_ms_16_c: 585.5 flac_decorrelate_ms_16_rvv_i32: 263.0 flac_decorrelate_ms_32_c: 584.7 flac_decorrelate_ms_32_rvv_i32: 250.0	2023-11-17 23:59:23 +02:00
Rémi Denis-Courmont	636ae0e0bc	lavc/flacdsp: R-V V packed decorrelate_{l,r}s flac_decorrelate_ms_16_c: 457.2 flac_decorrelate_ms_16_rvv_i32: 203.0 flac_decorrelate_ms_32_c: 457.2 flac_decorrelate_ms_32_rvv_i32: 203.5 flac_decorrelate_rs_16_c: 456.2 flac_decorrelate_rs_16_rvv_i32: 207.0 flac_decorrelate_rs_32_c: 456.2 flac_decorrelate_rs_32_rvv_i32: 210.5	2023-11-17 23:59:22 +02:00
Rémi Denis-Courmont	d076517056	lavc/llauddsp: R-V V scalarproduct_and_madd_int32 scalarproduct_and_madd_int32_c: 10899.7 scalarproduct_and_madd_int32_rvv_i32: 1749.0	2023-11-16 16:53:44 +02:00
Rémi Denis-Courmont	45d0eb3f70	lavc/llauddsp: R-V V scalarproduct_and_madd_int16 scalarproduct_and_madd_int16_c: 10355.7 scalarproduct_and_madd_int16_rvv_i32: 1480.0	2023-11-16 16:53:44 +02:00
Rémi Denis-Courmont	90a779bed6	lavc/huffyuvdsp: basic R-V V add_hfyu_left_pred_bgr32 Better performance can probably be achieved with a more intricate unrolled loop, but this is a start: add_hfyu_left_pred_bgr32_c: 15084.0 add_hfyu_left_pred_bgr32_rvv_i32: 10280.2 This would actually be cleaner with the RISC-V P extension, but that is not ratified yet (I think?) and usually not supported if V is supported.	2023-11-15 16:51:07 +02:00
Rémi Denis-Courmont	c536e92207	lavc/sbrdsp: R-V V hf_apply_noise functions This is restricted to 128-bit vectors as larger vector sizes could read past the end of the noise array. Support for future hardware with larger vector sizes is left for some other time. hf_apply_noise_0_c: 2319.7 hf_apply_noise_0_rvv_f32: 1229.0 hf_apply_noise_1_c: 2539.0 hf_apply_noise_1_rvv_f32: 1244.7 hf_apply_noise_2_c: 2319.7 hf_apply_noise_2_rvv_f32: 1232.7 hf_apply_noise_3_c: 2541.2 hf_apply_noise_3_rvv_f32: 1244.2	2023-11-13 18:34:29 +02:00
Rémi Denis-Courmont	5b33104fca	lavc/sbrdsp: R-V V hf_gen hf_gen_c: 2922.7 hf_gen_rvv_f32: 731.5	2023-11-13 18:33:02 +02:00
Rémi Denis-Courmont	cd7b352c53	lavc/sbrdsp: R-V V autocorrelate With 5 accumulator vectors and 6 inputs, this can only use LMUL=2. Also the number of vector loop iterations is small, just 5 on 128-bit vector hardware. The vector loop is somewhat unusual in that it processes data in descending memory order, in order to save on vector slides: in descending order, we can extract elements to carry over to the next iteration from the bottom of the vectors directly. With ascending order (see in the Opus postfilter function), there are no ways to get the top elements directly. On the downside, this requires the use of separate shift and sub (the would-be SH3SUB instruction does not exist), with a small pipeline stall on the vector load address. The edge cases in scalar are done in scalar as this saves on loads and remains significantly faster than C. autocorrelate_c: 669.2 autocorrelate_rvv_f32: 421.0	2023-11-12 14:03:09 +02:00
Rémi Denis-Courmont	f576a0835b	lavc/aacpsdsp: rework R-V V hybrid_synthesis_deint Given the size of the data set, strided memory accesses cannot be avoided. We can still do better than the current code. ps_hybrid_synthesis_deint_c: 12065.5 ps_hybrid_synthesis_deint_rvv_i32: 13650.2 (before) ps_hybrid_synthesis_deint_rvv_i64: 8181.0 (after)	2023-11-12 14:03:09 +02:00

1 2 3

125 Commits