ldconfig segfault on RZ/Five was Re: Preparing isar-cip-core for RZ/Five


Pavel Machek
 

Hi!

Hmm, seems the issue persists:
:-(. Do you get gcc faulting, too?
I tried, but installation fails - illegal instruction.
Yeah, ldconfig is needed for installation. But I get a segfaulting gcc
binary.

root@demo:~# ldconfig

[ 297.146728] ldconfig[497]: unhandled signal 4 code 0x1 at 0x00000000000380c8 in ldconfig[10000+83000]
...
(gdb) disassemble $pc,+0x10
Dump of assembler code from 0x380c8 to 0x380d8:
=> 0x00000000000380c8: auipc a2,0x66
0x00000000000380cc: addi a2,a2,2000 # 0x9e898
0x00000000000380d0: sd a0,0(a2)
auipc is something rather simple. a2 = pc + 0x66 << something. Not
sure how it could fault. Plus we get "illegal instruction", suggesting
it is not some other fault.

Could some kind of self-modifying code be involved? I guess some kind
of debugging/watchpoint is not probable.
No idea - but why should ldconfig be self-modifying?
No idea.

But I do have slightly different results then you (I think; I'm far
from risc-v expert). I did a breakpoint:

Breakpoint 1, 0x00000000000385d4 in ?? ()
(gdb)

Dump of assembler code from 0x385d4 to 0x385f4:
=> 0x00000000000385d4: lb zero,81(t1)
0x00000000000385d8: andi a1,a1,25
0x00000000000385da: sd zero,24(sp)
0x00000000000385dc: sd zero,32(sp)

If I do the stepi, it will give the illegal instruction, because,
well, we are in the middle of the auipc instruction:

(gdb) disassemble $pc-0x10,+0x20
Dump of assembler code from 0x385c4 to 0x385e4:
0x00000000000385c4: .4byte 0x4881f753
0x00000000000385c8: li a6,0
0x00000000000385ca: li a5,0
0x00000000000385cc: addi a3,a1,920
0x00000000000385d0: mv a2,s8
0x00000000000385d2: auipc a0,0x3f
0x00000000000385d6: addi a0,a0,-1890 # 0x76e70
0x00000000000385da: sd zero,24(sp)
0x00000000000385dc: sd zero,32(sp)
0x00000000000385de: sb t3,20(sp)
0x00000000000385e2: sd s7,40(sp)
End of assembler dump.
(gdb)

Weird. But it explains sigill when executing auipc does not result in
segfault...

Best regards,
Pavel
--
DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany


Pavel Machek
 

Hi!

I tried, but installation fails - illegal instruction.
Yeah, ldconfig is needed for installation. But I get a segfaulting gcc
binary.
It crashes rather soon after startup, so I was able to trace complete
path.

But I do have slightly different results then you (I think; I'm far
from risc-v expert). I did a breakpoint:

Breakpoint 1, 0x00000000000385d4 in ?? ()
I believe it should not end at 0x00000000000385d4 at all. The
0x000000000001537e jal instruction should end up calling 0x3806a
AFAICT, but it calls 0x385d4 instead. It happens during
single-stepping, so it should not be anything subtle.

(gdb) disassemble $pc,+0x20
Dump of assembler code from 0x1537c to 0x1539c:
=> 0x000000000001537c: mv a0,a4
0x000000000001537e: jal ra,0x3806a
0x0000000000015382: auipc a5,0x8a
0x0000000000015386: addi a5,a5,1342 # 0x9f8c0
0x000000000001538a: ld a4,0(a5)
0x000000000001538c: beqz a4,0x153f0
0x000000000001538e: jal ra,0x38abe
0x0000000000015392: ld a0,0(s6)
0x0000000000015396: auipc s7,0x85
0x000000000001539a: ld s7,-406(s7) # 0x9a200
End of assembler dump.
(gdb)
(gdb) stepi
0x000000000001537e in ?? ()
(gdb)

Program received signal SIGILL, Illegal instruction.
0x00000000000385d4 in ?? ()
(gdb)

Best regards,
Pavel
--
DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany


Jan Kiszka
 

On 07.10.22 00:32, Pavel Machek wrote:
Hi!

I tried, but installation fails - illegal instruction.
Yeah, ldconfig is needed for installation. But I get a segfaulting gcc
binary.
It crashes rather soon after startup, so I was able to trace complete
path.

But I do have slightly different results then you (I think; I'm far
from risc-v expert). I did a breakpoint:

Breakpoint 1, 0x00000000000385d4 in ?? ()
I believe it should not end at 0x00000000000385d4 at all. The
0x000000000001537e jal instruction should end up calling 0x3806a
AFAICT, but it calls 0x385d4 instead. It happens during
single-stepping, so it should not be anything subtle.

(gdb) disassemble $pc,+0x20
Dump of assembler code from 0x1537c to 0x1539c:
=> 0x000000000001537c: mv a0,a4
0x000000000001537e: jal ra,0x3806a
0x0000000000015382: auipc a5,0x8a
0x0000000000015386: addi a5,a5,1342 # 0x9f8c0
0x000000000001538a: ld a4,0(a5)
0x000000000001538c: beqz a4,0x153f0
0x000000000001538e: jal ra,0x38abe
0x0000000000015392: ld a0,0(s6)
0x0000000000015396: auipc s7,0x85
0x000000000001539a: ld s7,-406(s7) # 0x9a200
End of assembler dump.
(gdb)
(gdb) stepi
0x000000000001537e in ?? ()
(gdb)

Program received signal SIGILL, Illegal instruction.
0x00000000000385d4 in ?? ()
(gdb)

Best regards,
Pavel
Did you try to compare the call trace to QEMU, where we divert?

Jan

--
Siemens AG, Technology
Competence Center Embedded Linux


Pavel Machek
 

Hi!

I tried, but installation fails - illegal instruction.
Yeah, ldconfig is needed for installation. But I get a segfaulting gcc
binary.
It crashes rather soon after startup, so I was able to trace complete
path.

But I do have slightly different results then you (I think; I'm far
from risc-v expert). I did a breakpoint:

Breakpoint 1, 0x00000000000385d4 in ?? ()
I believe it should not end at 0x00000000000385d4 at all. The
0x000000000001537e jal instruction should end up calling 0x3806a
AFAICT, but it calls 0x385d4 instead. It happens during
single-stepping, so it should not be anything subtle.

(gdb) disassemble $pc,+0x20
Dump of assembler code from 0x1537c to 0x1539c:
=> 0x000000000001537c: mv a0,a4
0x000000000001537e: jal ra,0x3806a
0x0000000000015382: auipc a5,0x8a
0x0000000000015386: addi a5,a5,1342 # 0x9f8c0
0x000000000001538a: ld a4,0(a5)
0x000000000001538c: beqz a4,0x153f0
0x000000000001538e: jal ra,0x38abe
0x0000000000015392: ld a0,0(s6)
0x0000000000015396: auipc s7,0x85
0x000000000001539a: ld s7,-406(s7) # 0x9a200
End of assembler dump.
(gdb)
(gdb) stepi
0x000000000001537e in ?? ()
(gdb)

Program received signal SIGILL, Illegal instruction.
0x00000000000385d4 in ?? ()
(gdb)
Did you try to compare the call trace to QEMU, where we divert?
Yes, that's possible way forward, but it will require some
considerable setup on my side.

If you have QEMU ready... objdump tells you ldconfig's entrypoint,
from that point you can just stepi. In less than 200 steps, you should
have sigill... and complete steps that lead to it.

Best regards,
Pavel
--
DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany


Jan Kiszka
 

On 07.10.22 12:19, Pavel Machek wrote:
Hi!

I tried, but installation fails - illegal instruction.
Yeah, ldconfig is needed for installation. But I get a segfaulting gcc
binary.
It crashes rather soon after startup, so I was able to trace complete
path.

But I do have slightly different results then you (I think; I'm far
from risc-v expert). I did a breakpoint:

Breakpoint 1, 0x00000000000385d4 in ?? ()
I believe it should not end at 0x00000000000385d4 at all. The
0x000000000001537e jal instruction should end up calling 0x3806a
AFAICT, but it calls 0x385d4 instead. It happens during
single-stepping, so it should not be anything subtle.

(gdb) disassemble $pc,+0x20
Dump of assembler code from 0x1537c to 0x1539c:
=> 0x000000000001537c: mv a0,a4
0x000000000001537e: jal ra,0x3806a
0x0000000000015382: auipc a5,0x8a
0x0000000000015386: addi a5,a5,1342 # 0x9f8c0
0x000000000001538a: ld a4,0(a5)
0x000000000001538c: beqz a4,0x153f0
0x000000000001538e: jal ra,0x38abe
0x0000000000015392: ld a0,0(s6)
0x0000000000015396: auipc s7,0x85
0x000000000001539a: ld s7,-406(s7) # 0x9a200
End of assembler dump.
(gdb)
(gdb) stepi
0x000000000001537e in ?? ()
(gdb)

Program received signal SIGILL, Illegal instruction.
0x00000000000385d4 in ?? ()
(gdb)
Did you try to compare the call trace to QEMU, where we divert?
Yes, that's possible way forward, but it will require some
considerable setup on my side.

If you have QEMU ready... objdump tells you ldconfig's entrypoint,
from that point you can just stepi. In less than 200 steps, you should
have sigill... and complete steps that lead to it.
I've updated sid-ports (dropped the snapshot pinning), and now I'm
getting a page fault on the instruction before the one that was causing
SIGILL before:

[ 558.490689] CPU: 0 PID: 3212 Comm: ldconfig Not tainted 5.10.83-cip1-riscv-renesas #1
[ 558.490697] epc: 00000000000380c6 ra : 0000000000015382 sp : 0000003fff9e3c10
[ 558.490703] gp : 0000000000099da8 tp : 0000003fe9c3c800 t0 : 0000003fe9c427c0
[ 558.490710] t1 : 0000003fe9cd059c t2 : 0000002acb8f2c00 s0 : 0000002b079e9510
[ 558.490716] s1 : 0000000000000001 a0 : 0000003fff9e3d18 a1 : 0000000000000001
[ 558.490722] a2 : 0000003fff9e3c88 a3 : 0000000000000000 a4 : 0000003fff9e3d18
[ 558.490728] a5 : 000000000009736e a6 : 0000003fff9e3c80 a7 : 00000000000000dd
[ 558.490734] s2 : 0000003fff9e3c88 s3 : 0000000000000000 s4 : 0000000000000000
[ 558.490740] s5 : 00000000000105a4 s6 : 000000000009e670 s7 : 0000002b079c8ab0
[ 558.490746] s8 : 0000002b079e91c0 s9 : 0000000000000000 s10: 0000002acb8fc9b0
[ 558.490752] s11: 0000002acb8fc920 t3 : 0000002acb80f5d8 t4 : 000000000009259c
[ 558.490758] t5 : 0000000000000004 t6 : 0000002b0799c010
[ 558.490764] status: 0000000200004020 badaddr: 00000000000000e1 cause: 000000000000000f

(gdb) disassemble 0x00000000000380c6,+0x10
Dump of assembler code from 0x380c6 to 0x380d6:
0x00000000000380c6: addi sp,sp,-416
0x00000000000380c8: auipc a2,0x66
0x00000000000380cc: addi a2,a2,2000 # 0x9e898
0x00000000000380d0: sd a0,0(a2)
0x00000000000380d2: mv a5,sp
0x00000000000380d4: addi a4,sp,416
End of assembler dump.

I've stepped this through under qemu as well, and the control flow is
identical. Registers are almost the same, except for some temporaries:

--- regs-qemu
+++ regs-rzfive
@@ -2,9 +2,9 @@
ra 0x15382 0x15382
sp 0x3ffffffbe0 0x3ffffffbe0
gp 0x99da8 0x99da8
-tp 0x3ff7e77800 0x3ff7e77800
-t0 0x3ff7e7d7c0 274742106048
-t1 0x3ff7f0b59c 274742687132
+tp 0x3ff7e78800 0x3ff7e78800
+t0 0x3ff7e7e7c0 274742110144
+t1 0x3ff7f0c59c 274742691228
t2 0x2aaab92c00 183252888576
fp 0x2aaabaee00 0x2aaabaee00
s1 0x1 1

No idea if that is normal (different machines, different memory sizes
and layouts) or a symptom of the problem.

Jan

--
Siemens AG, Technology
Competence Center Embedded Linux


Jan Kiszka
 

On 08.10.22 10:27, Jan Kiszka wrote:
On 07.10.22 12:19, Pavel Machek wrote:
Hi!

I tried, but installation fails - illegal instruction.
Yeah, ldconfig is needed for installation. But I get a segfaulting gcc
binary.
It crashes rather soon after startup, so I was able to trace complete
path.

But I do have slightly different results then you (I think; I'm far
from risc-v expert). I did a breakpoint:

Breakpoint 1, 0x00000000000385d4 in ?? ()
I believe it should not end at 0x00000000000385d4 at all. The
0x000000000001537e jal instruction should end up calling 0x3806a
AFAICT, but it calls 0x385d4 instead. It happens during
single-stepping, so it should not be anything subtle.

(gdb) disassemble $pc,+0x20
Dump of assembler code from 0x1537c to 0x1539c:
=> 0x000000000001537c: mv a0,a4
0x000000000001537e: jal ra,0x3806a
0x0000000000015382: auipc a5,0x8a
0x0000000000015386: addi a5,a5,1342 # 0x9f8c0
0x000000000001538a: ld a4,0(a5)
0x000000000001538c: beqz a4,0x153f0
0x000000000001538e: jal ra,0x38abe
0x0000000000015392: ld a0,0(s6)
0x0000000000015396: auipc s7,0x85
0x000000000001539a: ld s7,-406(s7) # 0x9a200
End of assembler dump.
(gdb)
(gdb) stepi
0x000000000001537e in ?? ()
(gdb)

Program received signal SIGILL, Illegal instruction.
0x00000000000385d4 in ?? ()
(gdb)
Did you try to compare the call trace to QEMU, where we divert?
Yes, that's possible way forward, but it will require some
considerable setup on my side.

If you have QEMU ready... objdump tells you ldconfig's entrypoint,
from that point you can just stepi. In less than 200 steps, you should
have sigill... and complete steps that lead to it.
I've updated sid-ports (dropped the snapshot pinning), and now I'm
getting a page fault on the instruction before the one that was causing
SIGILL before:

[ 558.490689] CPU: 0 PID: 3212 Comm: ldconfig Not tainted 5.10.83-cip1-riscv-renesas #1
[ 558.490697] epc: 00000000000380c6 ra : 0000000000015382 sp : 0000003fff9e3c10
[ 558.490703] gp : 0000000000099da8 tp : 0000003fe9c3c800 t0 : 0000003fe9c427c0
[ 558.490710] t1 : 0000003fe9cd059c t2 : 0000002acb8f2c00 s0 : 0000002b079e9510
[ 558.490716] s1 : 0000000000000001 a0 : 0000003fff9e3d18 a1 : 0000000000000001
[ 558.490722] a2 : 0000003fff9e3c88 a3 : 0000000000000000 a4 : 0000003fff9e3d18
[ 558.490728] a5 : 000000000009736e a6 : 0000003fff9e3c80 a7 : 00000000000000dd
[ 558.490734] s2 : 0000003fff9e3c88 s3 : 0000000000000000 s4 : 0000000000000000
[ 558.490740] s5 : 00000000000105a4 s6 : 000000000009e670 s7 : 0000002b079c8ab0
[ 558.490746] s8 : 0000002b079e91c0 s9 : 0000000000000000 s10: 0000002acb8fc9b0
[ 558.490752] s11: 0000002acb8fc920 t3 : 0000002acb80f5d8 t4 : 000000000009259c
[ 558.490758] t5 : 0000000000000004 t6 : 0000002b0799c010
[ 558.490764] status: 0000000200004020 badaddr: 00000000000000e1 cause: 000000000000000f

(gdb) disassemble 0x00000000000380c6,+0x10
Dump of assembler code from 0x380c6 to 0x380d6:
0x00000000000380c6: addi sp,sp,-416
0x00000000000380c8: auipc a2,0x66
0x00000000000380cc: addi a2,a2,2000 # 0x9e898
0x00000000000380d0: sd a0,0(a2)
0x00000000000380d2: mv a5,sp
0x00000000000380d4: addi a4,sp,416
End of assembler dump.

I've stepped this through under qemu as well, and the control flow is
identical. Registers are almost the same, except for some temporaries:

--- regs-qemu
+++ regs-rzfive
@@ -2,9 +2,9 @@
ra 0x15382 0x15382
sp 0x3ffffffbe0 0x3ffffffbe0
gp 0x99da8 0x99da8
-tp 0x3ff7e77800 0x3ff7e77800
-t0 0x3ff7e7d7c0 274742106048
-t1 0x3ff7f0b59c 274742687132
+tp 0x3ff7e78800 0x3ff7e78800
+t0 0x3ff7e7e7c0 274742110144
+t1 0x3ff7f0c59c 274742691228
t2 0x2aaab92c00 183252888576
fp 0x2aaabaee00 0x2aaabaee00
s1 0x1 1

No idea if that is normal (different machines, different memory sizes
and layouts) or a symptom of the problem.
...
OpenEmbedded nodistro.0 smarc-rzfive ttySC0


[ 12.829622] audit: type=1006 audit(1653987107.735:2): pid=156 uid=0 old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=1 res=1
root@smarc-rzfive:~# ldconfig
[ 22.278868] ldconfig[166]: unhandled signal 11 code 0x1 at 0x0000000000000088 in ldconfig[10000+68000]
[ 22.290244] CPU: 0 PID: 166 Comm: ldconfig Not tainted 5.10.83-cip1-riscv-renesas #1
[ 22.298954] epc: 0000000000030eea ra : 00000000000145a0 sp : 0000003fff9f8aa0
[ 22.306906] gp : 000000000007fe48 tp : 0000003fd958b720 t0 : 0000000000000000
[ 22.314973] t1 : 0000002adf9c3bbc t2 : 00000000000003ff s0 : 0000003fff9f8c90
[ 22.322986] s1 : 0000000000014b0e a0 : 0000003fff9f8c98 a1 : 0000000000000000
[ 22.330967] a2 : 0000003fff9f8be8 a3 : 0000000000014a86 a4 : 000000000007e576
[ 22.338936] a5 : 0000000000000000 a6 : 0000003fff9f8be0 a7 : 0000000000000000
[ 22.346897] s2 : 0000000000000000 s3 : 0000003fd96df918 s4 : ffffffffffffffff
[ 22.354905] s5 : 0000002b01953f70 s6 : 0000002b01953c60 s7 : 0000002b019539b0
[ 22.362875] s8 : 0000002b01953b50 s9 : 0000000000000000 s10: 0000002adfa74584
[ 22.370884] s11: 0000000000000000 t3 : 0000003fd960ee18 t4 : 000000000000000f
[ 22.378945] t5 : 000000000000000f t6 : 0000000000000000
[ 22.385051] status: 8000000200004020 badaddr: 0000000000000088 cause: 000000000000000d
[ 22.393860] audit: type=1701 audit(1653987117.299:3): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=166 comm="ldconfig" exe="/sbin/ldconfig" sig=11 res=1
Segmentation fault


That was the version I found on eMMC.

I think you have some real homework now...

Jan

--
Siemens AG, Technology
Competence Center Embedded Linux


Biju Das
 

Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing
isar-cip-core for RZ/Five

On 08.10.22 10:27, Jan Kiszka wrote:
On 07.10.22 12:19, Pavel Machek wrote:
Hi!

I tried, but installation fails - illegal instruction.
Yeah, ldconfig is needed for installation. But I get a
segfaulting
gcc binary.
It crashes rather soon after startup, so I was able to trace
complete path.

But I do have slightly different results then you (I think; I'm
far from risc-v expert). I did a breakpoint:

Breakpoint 1, 0x00000000000385d4 in ?? ()
I believe it should not end at 0x00000000000385d4 at all. The
0x000000000001537e jal instruction should end up calling 0x3806a
AFAICT, but it calls 0x385d4 instead. It happens during
single-stepping, so it should not be anything subtle.

(gdb) disassemble $pc,+0x20
Dump of assembler code from 0x1537c to 0x1539c:
=> 0x000000000001537c: mv a0,a4
0x000000000001537e: jal ra,0x3806a
0x0000000000015382: auipc a5,0x8a
0x0000000000015386: addi a5,a5,1342 # 0x9f8c0
0x000000000001538a: ld a4,0(a5)
0x000000000001538c: beqz a4,0x153f0
0x000000000001538e: jal ra,0x38abe
0x0000000000015392: ld a0,0(s6)
0x0000000000015396: auipc s7,0x85
0x000000000001539a: ld s7,-406(s7) # 0x9a200
End of assembler dump.
(gdb)
(gdb) stepi
0x000000000001537e in ?? ()
(gdb)

Program received signal SIGILL, Illegal instruction.
0x00000000000385d4 in ?? ()
(gdb)
Did you try to compare the call trace to QEMU, where we divert?
Yes, that's possible way forward, but it will require some
considerable setup on my side.

If you have QEMU ready... objdump tells you ldconfig's entrypoint,
from that point you can just stepi. In less than 200 steps, you
should have sigill... and complete steps that lead to it.
I've updated sid-ports (dropped the snapshot pinning), and now I'm
getting a page fault on the instruction before the one that was
causing SIGILL before:

[ 558.490689] CPU: 0 PID: 3212 Comm: ldconfig Not tainted
5.10.83-cip1-riscv-renesas #1 [ 558.490697] epc: 00000000000380c6
ra
: 0000000000015382 sp : 0000003fff9e3c10 [ 558.490703] gp :
0000000000099da8 tp : 0000003fe9c3c800 t0 : 0000003fe9c427c0 [
558.490710] t1 : 0000003fe9cd059c t2 : 0000002acb8f2c00 s0 :
0000002b079e9510 [ 558.490716] s1 : 0000000000000001 a0 :
0000003fff9e3d18 a1 : 0000000000000001 [ 558.490722] a2 :
0000003fff9e3c88 a3 : 0000000000000000 a4 : 0000003fff9e3d18 [
558.490728] a5 : 000000000009736e a6 : 0000003fff9e3c80 a7 :
00000000000000dd [ 558.490734] s2 : 0000003fff9e3c88 s3 :
0000000000000000 s4 : 0000000000000000 [ 558.490740] s5 :
00000000000105a4 s6 : 000000000009e670 s7 : 0000002b079c8ab0 [
558.490746] s8 : 0000002b079e91c0 s9 : 0000000000000000 s10:
0000002acb8fc9b0 [ 558.490752] s11: 0000002acb8fc920 t3 :
0000002acb80f5d8 t4 : 000000000009259c [ 558.490758] t5 :
0000000000000004 t6 : 0000002b0799c010 [ 558.490764] status:
0000000200004020 badaddr: 00000000000000e1 cause: 000000000000000f

(gdb) disassemble 0x00000000000380c6,+0x10 Dump of assembler code
from
0x380c6 to 0x380d6:
0x00000000000380c6: addi sp,sp,-416
0x00000000000380c8: auipc a2,0x66
0x00000000000380cc: addi a2,a2,2000 # 0x9e898
0x00000000000380d0: sd a0,0(a2)
0x00000000000380d2: mv a5,sp
0x00000000000380d4: addi a4,sp,416
End of assembler dump.

I've stepped this through under qemu as well, and the control flow
is
identical. Registers are almost the same, except for some
temporaries:

--- regs-qemu
+++ regs-rzfive
@@ -2,9 +2,9 @@
ra 0x15382 0x15382
sp 0x3ffffffbe0 0x3ffffffbe0
gp 0x99da8 0x99da8
-tp 0x3ff7e77800 0x3ff7e77800
-t0 0x3ff7e7d7c0 274742106048
-t1 0x3ff7f0b59c 274742687132
+tp 0x3ff7e78800 0x3ff7e78800
+t0 0x3ff7e7e7c0 274742110144
+t1 0x3ff7f0c59c 274742691228
t2 0x2aaab92c00 183252888576
fp 0x2aaabaee00 0x2aaabaee00
s1 0x1 1

No idea if that is normal (different machines, different memory
sizes
and layouts) or a symptom of the problem.
...
OpenEmbedded nodistro.0 smarc-rzfive ttySC0


[ 12.829622] audit: type=1006 audit(1653987107.735:2): pid=156 uid=0
old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=1 res=1
root@smarc-rzfive:~# ldconfig
[ 22.278868] ldconfig[166]: unhandled signal 11 code 0x1 at
0x0000000000000088 in ldconfig[10000+68000]
[ 22.290244] CPU: 0 PID: 166 Comm: ldconfig Not tainted 5.10.83-
cip1-riscv-renesas #1
[ 22.298954] epc: 0000000000030eea ra : 00000000000145a0 sp :
0000003fff9f8aa0
[ 22.306906] gp : 000000000007fe48 tp : 0000003fd958b720 t0 :
0000000000000000
[ 22.314973] t1 : 0000002adf9c3bbc t2 : 00000000000003ff s0 :
0000003fff9f8c90
[ 22.322986] s1 : 0000000000014b0e a0 : 0000003fff9f8c98 a1 :
0000000000000000
[ 22.330967] a2 : 0000003fff9f8be8 a3 : 0000000000014a86 a4 :
000000000007e576
[ 22.338936] a5 : 0000000000000000 a6 : 0000003fff9f8be0 a7 :
0000000000000000
[ 22.346897] s2 : 0000000000000000 s3 : 0000003fd96df918 s4 :
ffffffffffffffff
[ 22.354905] s5 : 0000002b01953f70 s6 : 0000002b01953c60 s7 :
0000002b019539b0
[ 22.362875] s8 : 0000002b01953b50 s9 : 0000000000000000 s10:
0000002adfa74584
[ 22.370884] s11: 0000000000000000 t3 : 0000003fd960ee18 t4 :
000000000000000f
[ 22.378945] t5 : 000000000000000f t6 : 0000000000000000
[ 22.385051] status: 8000000200004020 badaddr: 0000000000000088
cause: 000000000000000d
[ 22.393860] audit: type=1701 audit(1653987117.299:3):
auid=4294967295 uid=0 gid=0 ses=4294967295 pid=166 comm="ldconfig"
exe="/sbin/ldconfig" sig=11 res=1
Segmentation fault


That was the version I found on eMMC.

I think you have some real homework now...
What is your conclusion? Is it tool chain related issue? Or cache related issue?

Or

Something else ?

Cheers,
Biju


Chris Paterson
 

Hi Jan,

From: Jan Kiszka <jan.kiszka@...>
Sent: 09 October 2022 09:29

On 08.10.22 10:27, Jan Kiszka wrote:
On 07.10.22 12:19, Pavel Machek wrote:
Hi!

I tried, but installation fails - illegal instruction.
Yeah, ldconfig is needed for installation. But I get a segfaulting gcc
binary.
It crashes rather soon after startup, so I was able to trace complete
path.

But I do have slightly different results then you (I think; I'm far
from risc-v expert). I did a breakpoint:

Breakpoint 1, 0x00000000000385d4 in ?? ()
I believe it should not end at 0x00000000000385d4 at all. The
0x000000000001537e jal instruction should end up calling 0x3806a
AFAICT, but it calls 0x385d4 instead. It happens during
single-stepping, so it should not be anything subtle.

(gdb) disassemble $pc,+0x20
Dump of assembler code from 0x1537c to 0x1539c:
=> 0x000000000001537c: mv a0,a4
0x000000000001537e: jal ra,0x3806a
0x0000000000015382: auipc a5,0x8a
0x0000000000015386: addi a5,a5,1342 # 0x9f8c0
0x000000000001538a: ld a4,0(a5)
0x000000000001538c: beqz a4,0x153f0
0x000000000001538e: jal ra,0x38abe
0x0000000000015392: ld a0,0(s6)
0x0000000000015396: auipc s7,0x85
0x000000000001539a: ld s7,-406(s7) # 0x9a200
End of assembler dump.
(gdb)
(gdb) stepi
0x000000000001537e in ?? ()
(gdb)

Program received signal SIGILL, Illegal instruction.
0x00000000000385d4 in ?? ()
(gdb)
Did you try to compare the call trace to QEMU, where we divert?
Yes, that's possible way forward, but it will require some
considerable setup on my side.

If you have QEMU ready... objdump tells you ldconfig's entrypoint,
from that point you can just stepi. In less than 200 steps, you should
have sigill... and complete steps that lead to it.
I've updated sid-ports (dropped the snapshot pinning), and now I'm
getting a page fault on the instruction before the one that was causing
SIGILL before:

[ 558.490689] CPU: 0 PID: 3212 Comm: ldconfig Not tainted 5.10.83-cip1-
riscv-renesas #1
[ 558.490697] epc: 00000000000380c6 ra : 0000000000015382 sp :
0000003fff9e3c10
[ 558.490703] gp : 0000000000099da8 tp : 0000003fe9c3c800 t0 :
0000003fe9c427c0
[ 558.490710] t1 : 0000003fe9cd059c t2 : 0000002acb8f2c00 s0 :
0000002b079e9510
[ 558.490716] s1 : 0000000000000001 a0 : 0000003fff9e3d18 a1 :
0000000000000001
[ 558.490722] a2 : 0000003fff9e3c88 a3 : 0000000000000000 a4 :
0000003fff9e3d18
[ 558.490728] a5 : 000000000009736e a6 : 0000003fff9e3c80 a7 :
00000000000000dd
[ 558.490734] s2 : 0000003fff9e3c88 s3 : 0000000000000000 s4 :
0000000000000000
[ 558.490740] s5 : 00000000000105a4 s6 : 000000000009e670 s7 :
0000002b079c8ab0
[ 558.490746] s8 : 0000002b079e91c0 s9 : 0000000000000000 s10:
0000002acb8fc9b0
[ 558.490752] s11: 0000002acb8fc920 t3 : 0000002acb80f5d8 t4 :
000000000009259c
[ 558.490758] t5 : 0000000000000004 t6 : 0000002b0799c010
[ 558.490764] status: 0000000200004020 badaddr: 00000000000000e1 cause:
000000000000000f

(gdb) disassemble 0x00000000000380c6,+0x10
Dump of assembler code from 0x380c6 to 0x380d6:
0x00000000000380c6: addi sp,sp,-416
0x00000000000380c8: auipc a2,0x66
0x00000000000380cc: addi a2,a2,2000 # 0x9e898
0x00000000000380d0: sd a0,0(a2)
0x00000000000380d2: mv a5,sp
0x00000000000380d4: addi a4,sp,416
End of assembler dump.

I've stepped this through under qemu as well, and the control flow is
identical. Registers are almost the same, except for some temporaries:

--- regs-qemu
+++ regs-rzfive
@@ -2,9 +2,9 @@
ra 0x15382 0x15382
sp 0x3ffffffbe0 0x3ffffffbe0
gp 0x99da8 0x99da8
-tp 0x3ff7e77800 0x3ff7e77800
-t0 0x3ff7e7d7c0 274742106048
-t1 0x3ff7f0b59c 274742687132
+tp 0x3ff7e78800 0x3ff7e78800
+t0 0x3ff7e7e7c0 274742110144
+t1 0x3ff7f0c59c 274742691228
t2 0x2aaab92c00 183252888576
fp 0x2aaabaee00 0x2aaabaee00
s1 0x1 1

No idea if that is normal (different machines, different memory sizes
and layouts) or a symptom of the problem.
...
OpenEmbedded nodistro.0 smarc-rzfive ttySC0


[ 12.829622] audit: type=1006 audit(1653987107.735:2): pid=156 uid=0 old-
auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=1 res=1
root@smarc-rzfive:~# ldconfig
[ 22.278868] ldconfig[166]: unhandled signal 11 code 0x1 at
0x0000000000000088 in ldconfig[10000+68000]
[ 22.290244] CPU: 0 PID: 166 Comm: ldconfig Not tainted 5.10.83-cip1-riscv-
renesas #1
[ 22.298954] epc: 0000000000030eea ra : 00000000000145a0 sp :
0000003fff9f8aa0
[ 22.306906] gp : 000000000007fe48 tp : 0000003fd958b720 t0 :
0000000000000000
[ 22.314973] t1 : 0000002adf9c3bbc t2 : 00000000000003ff s0 :
0000003fff9f8c90
[ 22.322986] s1 : 0000000000014b0e a0 : 0000003fff9f8c98 a1 :
0000000000000000
[ 22.330967] a2 : 0000003fff9f8be8 a3 : 0000000000014a86 a4 :
000000000007e576
[ 22.338936] a5 : 0000000000000000 a6 : 0000003fff9f8be0 a7 :
0000000000000000
[ 22.346897] s2 : 0000000000000000 s3 : 0000003fd96df918 s4 : ffffffffffffffff
[ 22.354905] s5 : 0000002b01953f70 s6 : 0000002b01953c60 s7 :
0000002b019539b0
[ 22.362875] s8 : 0000002b01953b50 s9 : 0000000000000000 s10:
0000002adfa74584
[ 22.370884] s11: 0000000000000000 t3 : 0000003fd960ee18 t4 :
000000000000000f
[ 22.378945] t5 : 000000000000000f t6 : 0000000000000000
[ 22.385051] status: 8000000200004020 badaddr: 0000000000000088 cause:
000000000000000d
[ 22.393860] audit: type=1701 audit(1653987117.299:3): auid=4294967295
uid=0 gid=0 ses=4294967295 pid=166 comm="ldconfig" exe="/sbin/ldconfig"
sig=11 res=1
Segmentation fault


That was the version I found on eMMC.

I think you have some real homework now...
Thanks, we'll take a look.

Kind regards, Chris


Jan

--
Siemens AG, Technology
Competence Center Embedded Linux


Jan Kiszka
 

On 09.10.22 10:42, Biju Das wrote:
Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing
isar-cip-core for RZ/Five

On 08.10.22 10:27, Jan Kiszka wrote:
On 07.10.22 12:19, Pavel Machek wrote:
Hi!

I tried, but installation fails - illegal instruction.
Yeah, ldconfig is needed for installation. But I get a
segfaulting
gcc binary.
It crashes rather soon after startup, so I was able to trace
complete path.

But I do have slightly different results then you (I think; I'm
far from risc-v expert). I did a breakpoint:

Breakpoint 1, 0x00000000000385d4 in ?? ()
I believe it should not end at 0x00000000000385d4 at all. The
0x000000000001537e jal instruction should end up calling 0x3806a
AFAICT, but it calls 0x385d4 instead. It happens during
single-stepping, so it should not be anything subtle.

(gdb) disassemble $pc,+0x20
Dump of assembler code from 0x1537c to 0x1539c:
=> 0x000000000001537c: mv a0,a4
0x000000000001537e: jal ra,0x3806a
0x0000000000015382: auipc a5,0x8a
0x0000000000015386: addi a5,a5,1342 # 0x9f8c0
0x000000000001538a: ld a4,0(a5)
0x000000000001538c: beqz a4,0x153f0
0x000000000001538e: jal ra,0x38abe
0x0000000000015392: ld a0,0(s6)
0x0000000000015396: auipc s7,0x85
0x000000000001539a: ld s7,-406(s7) # 0x9a200
End of assembler dump.
(gdb)
(gdb) stepi
0x000000000001537e in ?? ()
(gdb)

Program received signal SIGILL, Illegal instruction.
0x00000000000385d4 in ?? ()
(gdb)
Did you try to compare the call trace to QEMU, where we divert?
Yes, that's possible way forward, but it will require some
considerable setup on my side.

If you have QEMU ready... objdump tells you ldconfig's entrypoint,
from that point you can just stepi. In less than 200 steps, you
should have sigill... and complete steps that lead to it.
I've updated sid-ports (dropped the snapshot pinning), and now I'm
getting a page fault on the instruction before the one that was
causing SIGILL before:

[ 558.490689] CPU: 0 PID: 3212 Comm: ldconfig Not tainted
5.10.83-cip1-riscv-renesas #1 [ 558.490697] epc: 00000000000380c6
ra
: 0000000000015382 sp : 0000003fff9e3c10 [ 558.490703] gp :
0000000000099da8 tp : 0000003fe9c3c800 t0 : 0000003fe9c427c0 [
558.490710] t1 : 0000003fe9cd059c t2 : 0000002acb8f2c00 s0 :
0000002b079e9510 [ 558.490716] s1 : 0000000000000001 a0 :
0000003fff9e3d18 a1 : 0000000000000001 [ 558.490722] a2 :
0000003fff9e3c88 a3 : 0000000000000000 a4 : 0000003fff9e3d18 [
558.490728] a5 : 000000000009736e a6 : 0000003fff9e3c80 a7 :
00000000000000dd [ 558.490734] s2 : 0000003fff9e3c88 s3 :
0000000000000000 s4 : 0000000000000000 [ 558.490740] s5 :
00000000000105a4 s6 : 000000000009e670 s7 : 0000002b079c8ab0 [
558.490746] s8 : 0000002b079e91c0 s9 : 0000000000000000 s10:
0000002acb8fc9b0 [ 558.490752] s11: 0000002acb8fc920 t3 :
0000002acb80f5d8 t4 : 000000000009259c [ 558.490758] t5 :
0000000000000004 t6 : 0000002b0799c010 [ 558.490764] status:
0000000200004020 badaddr: 00000000000000e1 cause: 000000000000000f

(gdb) disassemble 0x00000000000380c6,+0x10 Dump of assembler code
from
0x380c6 to 0x380d6:
0x00000000000380c6: addi sp,sp,-416
0x00000000000380c8: auipc a2,0x66
0x00000000000380cc: addi a2,a2,2000 # 0x9e898
0x00000000000380d0: sd a0,0(a2)
0x00000000000380d2: mv a5,sp
0x00000000000380d4: addi a4,sp,416
End of assembler dump.

I've stepped this through under qemu as well, and the control flow
is
identical. Registers are almost the same, except for some
temporaries:

--- regs-qemu
+++ regs-rzfive
@@ -2,9 +2,9 @@
ra 0x15382 0x15382
sp 0x3ffffffbe0 0x3ffffffbe0
gp 0x99da8 0x99da8
-tp 0x3ff7e77800 0x3ff7e77800
-t0 0x3ff7e7d7c0 274742106048
-t1 0x3ff7f0b59c 274742687132
+tp 0x3ff7e78800 0x3ff7e78800
+t0 0x3ff7e7e7c0 274742110144
+t1 0x3ff7f0c59c 274742691228
t2 0x2aaab92c00 183252888576
fp 0x2aaabaee00 0x2aaabaee00
s1 0x1 1

No idea if that is normal (different machines, different memory
sizes
and layouts) or a symptom of the problem.
...
OpenEmbedded nodistro.0 smarc-rzfive ttySC0


[ 12.829622] audit: type=1006 audit(1653987107.735:2): pid=156 uid=0
old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=1 res=1
root@smarc-rzfive:~# ldconfig
[ 22.278868] ldconfig[166]: unhandled signal 11 code 0x1 at
0x0000000000000088 in ldconfig[10000+68000]
[ 22.290244] CPU: 0 PID: 166 Comm: ldconfig Not tainted 5.10.83-
cip1-riscv-renesas #1
[ 22.298954] epc: 0000000000030eea ra : 00000000000145a0 sp :
0000003fff9f8aa0
[ 22.306906] gp : 000000000007fe48 tp : 0000003fd958b720 t0 :
0000000000000000
[ 22.314973] t1 : 0000002adf9c3bbc t2 : 00000000000003ff s0 :
0000003fff9f8c90
[ 22.322986] s1 : 0000000000014b0e a0 : 0000003fff9f8c98 a1 :
0000000000000000
[ 22.330967] a2 : 0000003fff9f8be8 a3 : 0000000000014a86 a4 :
000000000007e576
[ 22.338936] a5 : 0000000000000000 a6 : 0000003fff9f8be0 a7 :
0000000000000000
[ 22.346897] s2 : 0000000000000000 s3 : 0000003fd96df918 s4 :
ffffffffffffffff
[ 22.354905] s5 : 0000002b01953f70 s6 : 0000002b01953c60 s7 :
0000002b019539b0
[ 22.362875] s8 : 0000002b01953b50 s9 : 0000000000000000 s10:
0000002adfa74584
[ 22.370884] s11: 0000000000000000 t3 : 0000003fd960ee18 t4 :
000000000000000f
[ 22.378945] t5 : 000000000000000f t6 : 0000000000000000
[ 22.385051] status: 8000000200004020 badaddr: 0000000000000088
cause: 000000000000000d
[ 22.393860] audit: type=1701 audit(1653987117.299:3):
auid=4294967295 uid=0 gid=0 ses=4294967295 pid=166 comm="ldconfig"
exe="/sbin/ldconfig" sig=11 res=1
Segmentation fault


That was the version I found on eMMC.

I think you have some real homework now...
What is your conclusion? Is it tool chain related issue? Or cache related issue?

Or

Something else ?
I have no idea and still only limited knowledge about the arch and this
SoC. We can just rule out by now that the issue is Debian-exclusive.

Jan

--
Siemens AG, Technology
Competence Center Embedded Linux


Biju Das
 

Subject: Re: RE: [cip-dev] ldconfig segfault on RZ/Five was Re:
Preparing isar-cip-core for RZ/Five

On 09.10.22 10:42, Biju Das wrote:
Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re:
Preparing
isar-cip-core for RZ/Five

On 08.10.22 10:27, Jan Kiszka wrote:
On 07.10.22 12:19, Pavel Machek wrote:
Hi!

I tried, but installation fails - illegal instruction.
Yeah, ldconfig is needed for installation. But I get a
segfaulting
gcc binary.
It crashes rather soon after startup, so I was able to trace
complete path.

But I do have slightly different results then you (I think;
I'm
far from risc-v expert). I did a breakpoint:

Breakpoint 1, 0x00000000000385d4 in ?? ()
I believe it should not end at 0x00000000000385d4 at all. The
0x000000000001537e jal instruction should end up calling
0x3806a
AFAICT, but it calls 0x385d4 instead. It happens during
single-stepping, so it should not be anything subtle.

(gdb) disassemble $pc,+0x20
Dump of assembler code from 0x1537c to 0x1539c:
=> 0x000000000001537c: mv a0,a4
0x000000000001537e: jal ra,0x3806a
0x0000000000015382: auipc a5,0x8a
0x0000000000015386: addi a5,a5,1342 # 0x9f8c0
0x000000000001538a: ld a4,0(a5)
0x000000000001538c: beqz a4,0x153f0
0x000000000001538e: jal ra,0x38abe
0x0000000000015392: ld a0,0(s6)
0x0000000000015396: auipc s7,0x85
0x000000000001539a: ld s7,-406(s7) # 0x9a200
End of assembler dump.
(gdb)
(gdb) stepi
0x000000000001537e in ?? ()
(gdb)

Program received signal SIGILL, Illegal instruction.
0x00000000000385d4 in ?? ()
(gdb)
Did you try to compare the call trace to QEMU, where we divert?
Yes, that's possible way forward, but it will require some
considerable setup on my side.

If you have QEMU ready... objdump tells you ldconfig's
entrypoint,
from that point you can just stepi. In less than 200 steps, you
should have sigill... and complete steps that lead to it.
I've updated sid-ports (dropped the snapshot pinning), and now I'm
getting a page fault on the instruction before the one that was
causing SIGILL before:

[ 558.490689] CPU: 0 PID: 3212 Comm: ldconfig Not tainted
5.10.83-cip1-riscv-renesas #1 [ 558.490697] epc: 00000000000380c6
ra
: 0000000000015382 sp : 0000003fff9e3c10 [ 558.490703] gp :
0000000000099da8 tp : 0000003fe9c3c800 t0 : 0000003fe9c427c0 [
558.490710] t1 : 0000003fe9cd059c t2 : 0000002acb8f2c00 s0 :
0000002b079e9510 [ 558.490716] s1 : 0000000000000001 a0 :
0000003fff9e3d18 a1 : 0000000000000001 [ 558.490722] a2 :
0000003fff9e3c88 a3 : 0000000000000000 a4 : 0000003fff9e3d18 [
558.490728] a5 : 000000000009736e a6 : 0000003fff9e3c80 a7 :
00000000000000dd [ 558.490734] s2 : 0000003fff9e3c88 s3 :
0000000000000000 s4 : 0000000000000000 [ 558.490740] s5 :
00000000000105a4 s6 : 000000000009e670 s7 : 0000002b079c8ab0 [
558.490746] s8 : 0000002b079e91c0 s9 : 0000000000000000 s10:
0000002acb8fc9b0 [ 558.490752] s11: 0000002acb8fc920 t3 :
0000002acb80f5d8 t4 : 000000000009259c [ 558.490758] t5 :
0000000000000004 t6 : 0000002b0799c010 [ 558.490764] status:
0000000200004020 badaddr: 00000000000000e1 cause: 000000000000000f

(gdb) disassemble 0x00000000000380c6,+0x10 Dump of assembler code
from
0x380c6 to 0x380d6:
0x00000000000380c6: addi sp,sp,-416
0x00000000000380c8: auipc a2,0x66
0x00000000000380cc: addi a2,a2,2000 # 0x9e898
0x00000000000380d0: sd a0,0(a2)
0x00000000000380d2: mv a5,sp
0x00000000000380d4: addi a4,sp,416
End of assembler dump.

I've stepped this through under qemu as well, and the control flow
is
identical. Registers are almost the same, except for some
temporaries:

--- regs-qemu
+++ regs-rzfive
@@ -2,9 +2,9 @@
ra 0x15382 0x15382
sp 0x3ffffffbe0 0x3ffffffbe0
gp 0x99da8 0x99da8
-tp 0x3ff7e77800 0x3ff7e77800
-t0 0x3ff7e7d7c0 274742106048
-t1 0x3ff7f0b59c 274742687132
+tp 0x3ff7e78800 0x3ff7e78800
+t0 0x3ff7e7e7c0 274742110144
+t1 0x3ff7f0c59c 274742691228
t2 0x2aaab92c00 183252888576
fp 0x2aaabaee00 0x2aaabaee00
s1 0x1 1

No idea if that is normal (different machines, different memory
sizes
and layouts) or a symptom of the problem.
...
OpenEmbedded nodistro.0 smarc-rzfive ttySC0


[ 12.829622] audit: type=1006 audit(1653987107.735:2): pid=156
uid=0
old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=1
res=1
root@smarc-rzfive:~# ldconfig
[ 22.278868] ldconfig[166]: unhandled signal 11 code 0x1 at
0x0000000000000088 in ldconfig[10000+68000]
[ 22.290244] CPU: 0 PID: 166 Comm: ldconfig Not tainted 5.10.83-
cip1-riscv-renesas #1
[ 22.298954] epc: 0000000000030eea ra : 00000000000145a0 sp :
0000003fff9f8aa0
[ 22.306906] gp : 000000000007fe48 tp : 0000003fd958b720 t0 :
0000000000000000
[ 22.314973] t1 : 0000002adf9c3bbc t2 : 00000000000003ff s0 :
0000003fff9f8c90
[ 22.322986] s1 : 0000000000014b0e a0 : 0000003fff9f8c98 a1 :
0000000000000000
[ 22.330967] a2 : 0000003fff9f8be8 a3 : 0000000000014a86 a4 :
000000000007e576
[ 22.338936] a5 : 0000000000000000 a6 : 0000003fff9f8be0 a7 :
0000000000000000
[ 22.346897] s2 : 0000000000000000 s3 : 0000003fd96df918 s4 :
ffffffffffffffff
[ 22.354905] s5 : 0000002b01953f70 s6 : 0000002b01953c60 s7 :
0000002b019539b0
[ 22.362875] s8 : 0000002b01953b50 s9 : 0000000000000000 s10:
0000002adfa74584
[ 22.370884] s11: 0000000000000000 t3 : 0000003fd960ee18 t4 :
000000000000000f
[ 22.378945] t5 : 000000000000000f t6 : 0000000000000000
[ 22.385051] status: 8000000200004020 badaddr: 0000000000000088
cause: 000000000000000d
[ 22.393860] audit: type=1701 audit(1653987117.299:3):
auid=4294967295 uid=0 gid=0 ses=4294967295 pid=166 comm="ldconfig"
exe="/sbin/ldconfig" sig=11 res=1
Segmentation fault


That was the version I found on eMMC.

I think you have some real homework now...
What is your conclusion? Is it tool chain related issue? Or cache
related issue?

Or

Something else ?
I have no idea and still only limited knowledge about the arch and
this SoC. We can just rule out by now that the issue is Debian-
exclusive.
Thanks for your feedback.

Cheers,
Biju


Florian Bezdeka
 

On 11.10.22 12:34, Biju Das via lists.cip-project.org wrote:
Subject: Re: RE: [cip-dev] ldconfig segfault on RZ/Five was Re:
Preparing isar-cip-core for RZ/Five

On 09.10.22 10:42, Biju Das wrote:
Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re:
Preparing
isar-cip-core for RZ/Five

On 08.10.22 10:27, Jan Kiszka wrote:
On 07.10.22 12:19, Pavel Machek wrote:
Hi!

I tried, but installation fails - illegal instruction.
Yeah, ldconfig is needed for installation. But I get a
segfaulting
gcc binary.
It crashes rather soon after startup, so I was able to trace
complete path.

But I do have slightly different results then you (I think;
I'm
far from risc-v expert). I did a breakpoint:

Breakpoint 1, 0x00000000000385d4 in ?? ()
I believe it should not end at 0x00000000000385d4 at all. The
0x000000000001537e jal instruction should end up calling
0x3806a
AFAICT, but it calls 0x385d4 instead. It happens during
single-stepping, so it should not be anything subtle.

(gdb) disassemble $pc,+0x20
Dump of assembler code from 0x1537c to 0x1539c:
=> 0x000000000001537c: mv a0,a4
0x000000000001537e: jal ra,0x3806a
0x0000000000015382: auipc a5,0x8a
0x0000000000015386: addi a5,a5,1342 # 0x9f8c0
0x000000000001538a: ld a4,0(a5)
0x000000000001538c: beqz a4,0x153f0
0x000000000001538e: jal ra,0x38abe
0x0000000000015392: ld a0,0(s6)
0x0000000000015396: auipc s7,0x85
0x000000000001539a: ld s7,-406(s7) # 0x9a200
End of assembler dump.
(gdb)
(gdb) stepi
0x000000000001537e in ?? ()
(gdb)

Program received signal SIGILL, Illegal instruction.
0x00000000000385d4 in ?? ()
(gdb)
Did you try to compare the call trace to QEMU, where we divert?
Yes, that's possible way forward, but it will require some
considerable setup on my side.

If you have QEMU ready... objdump tells you ldconfig's
entrypoint,
from that point you can just stepi. In less than 200 steps, you
should have sigill... and complete steps that lead to it.
I've updated sid-ports (dropped the snapshot pinning), and now I'm
getting a page fault on the instruction before the one that was
causing SIGILL before:
In case the requested page is a page with PROT_WRITE only (no PROT_READ)
it might be related to
https://lore.kernel.org/linux-riscv/20220915193702.2201018-1-abrestic@rivosinc.com/

AFAIR all stable branches have that problem currently.


[ 558.490689] CPU: 0 PID: 3212 Comm: ldconfig Not tainted
5.10.83-cip1-riscv-renesas #1 [ 558.490697] epc: 00000000000380c6
ra
: 0000000000015382 sp : 0000003fff9e3c10 [ 558.490703] gp :
0000000000099da8 tp : 0000003fe9c3c800 t0 : 0000003fe9c427c0 [
558.490710] t1 : 0000003fe9cd059c t2 : 0000002acb8f2c00 s0 :
0000002b079e9510 [ 558.490716] s1 : 0000000000000001 a0 :
0000003fff9e3d18 a1 : 0000000000000001 [ 558.490722] a2 :
0000003fff9e3c88 a3 : 0000000000000000 a4 : 0000003fff9e3d18 [
558.490728] a5 : 000000000009736e a6 : 0000003fff9e3c80 a7 :
00000000000000dd [ 558.490734] s2 : 0000003fff9e3c88 s3 :
0000000000000000 s4 : 0000000000000000 [ 558.490740] s5 :
00000000000105a4 s6 : 000000000009e670 s7 : 0000002b079c8ab0 [
558.490746] s8 : 0000002b079e91c0 s9 : 0000000000000000 s10:
0000002acb8fc9b0 [ 558.490752] s11: 0000002acb8fc920 t3 :
0000002acb80f5d8 t4 : 000000000009259c [ 558.490758] t5 :
0000000000000004 t6 : 0000002b0799c010 [ 558.490764] status:
0000000200004020 badaddr: 00000000000000e1 cause: 000000000000000f

(gdb) disassemble 0x00000000000380c6,+0x10 Dump of assembler code
from
0x380c6 to 0x380d6:
0x00000000000380c6: addi sp,sp,-416
0x00000000000380c8: auipc a2,0x66
0x00000000000380cc: addi a2,a2,2000 # 0x9e898
0x00000000000380d0: sd a0,0(a2)
0x00000000000380d2: mv a5,sp
0x00000000000380d4: addi a4,sp,416
End of assembler dump.

I've stepped this through under qemu as well, and the control flow
is
identical. Registers are almost the same, except for some
temporaries:

--- regs-qemu
+++ regs-rzfive
@@ -2,9 +2,9 @@
ra 0x15382 0x15382
sp 0x3ffffffbe0 0x3ffffffbe0
gp 0x99da8 0x99da8
-tp 0x3ff7e77800 0x3ff7e77800
-t0 0x3ff7e7d7c0 274742106048
-t1 0x3ff7f0b59c 274742687132
+tp 0x3ff7e78800 0x3ff7e78800
+t0 0x3ff7e7e7c0 274742110144
+t1 0x3ff7f0c59c 274742691228
t2 0x2aaab92c00 183252888576
fp 0x2aaabaee00 0x2aaabaee00
s1 0x1 1

No idea if that is normal (different machines, different memory
sizes
and layouts) or a symptom of the problem.
...
OpenEmbedded nodistro.0 smarc-rzfive ttySC0


[ 12.829622] audit: type=1006 audit(1653987107.735:2): pid=156
uid=0
old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=1
res=1
root@smarc-rzfive:~# ldconfig
[ 22.278868] ldconfig[166]: unhandled signal 11 code 0x1 at
0x0000000000000088 in ldconfig[10000+68000]
[ 22.290244] CPU: 0 PID: 166 Comm: ldconfig Not tainted 5.10.83-
cip1-riscv-renesas #1
[ 22.298954] epc: 0000000000030eea ra : 00000000000145a0 sp :
0000003fff9f8aa0
[ 22.306906] gp : 000000000007fe48 tp : 0000003fd958b720 t0 :
0000000000000000
[ 22.314973] t1 : 0000002adf9c3bbc t2 : 00000000000003ff s0 :
0000003fff9f8c90
[ 22.322986] s1 : 0000000000014b0e a0 : 0000003fff9f8c98 a1 :
0000000000000000
[ 22.330967] a2 : 0000003fff9f8be8 a3 : 0000000000014a86 a4 :
000000000007e576
[ 22.338936] a5 : 0000000000000000 a6 : 0000003fff9f8be0 a7 :
0000000000000000
[ 22.346897] s2 : 0000000000000000 s3 : 0000003fd96df918 s4 :
ffffffffffffffff
[ 22.354905] s5 : 0000002b01953f70 s6 : 0000002b01953c60 s7 :
0000002b019539b0
[ 22.362875] s8 : 0000002b01953b50 s9 : 0000000000000000 s10:
0000002adfa74584
[ 22.370884] s11: 0000000000000000 t3 : 0000003fd960ee18 t4 :
000000000000000f
[ 22.378945] t5 : 000000000000000f t6 : 0000000000000000
[ 22.385051] status: 8000000200004020 badaddr: 0000000000000088
cause: 000000000000000d
[ 22.393860] audit: type=1701 audit(1653987117.299:3):
auid=4294967295 uid=0 gid=0 ses=4294967295 pid=166 comm="ldconfig"
exe="/sbin/ldconfig" sig=11 res=1
Segmentation fault


That was the version I found on eMMC.

I think you have some real homework now...
What is your conclusion? Is it tool chain related issue? Or cache
related issue?

Or

Something else ?
I have no idea and still only limited knowledge about the arch and
this SoC. We can just rule out by now that the issue is Debian-
exclusive.
Thanks for your feedback.

Cheers,
Biju





Jan Kiszka
 

On 11.10.22 20:51, Florian Bezdeka wrote:
On 11.10.22 12:34, Biju Das via lists.cip-project.org wrote:
Subject: Re: RE: [cip-dev] ldconfig segfault on RZ/Five was Re:
Preparing isar-cip-core for RZ/Five

On 09.10.22 10:42, Biju Das wrote:
Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re:
Preparing
isar-cip-core for RZ/Five

On 08.10.22 10:27, Jan Kiszka wrote:
On 07.10.22 12:19, Pavel Machek wrote:
Hi!

I tried, but installation fails - illegal instruction.
Yeah, ldconfig is needed for installation. But I get a
segfaulting
gcc binary.
It crashes rather soon after startup, so I was able to trace
complete path.

But I do have slightly different results then you (I think;
I'm
far from risc-v expert). I did a breakpoint:

Breakpoint 1, 0x00000000000385d4 in ?? ()
I believe it should not end at 0x00000000000385d4 at all. The
0x000000000001537e jal instruction should end up calling
0x3806a
AFAICT, but it calls 0x385d4 instead. It happens during
single-stepping, so it should not be anything subtle.

(gdb) disassemble $pc,+0x20
Dump of assembler code from 0x1537c to 0x1539c:
=> 0x000000000001537c: mv a0,a4
0x000000000001537e: jal ra,0x3806a
0x0000000000015382: auipc a5,0x8a
0x0000000000015386: addi a5,a5,1342 # 0x9f8c0
0x000000000001538a: ld a4,0(a5)
0x000000000001538c: beqz a4,0x153f0
0x000000000001538e: jal ra,0x38abe
0x0000000000015392: ld a0,0(s6)
0x0000000000015396: auipc s7,0x85
0x000000000001539a: ld s7,-406(s7) # 0x9a200
End of assembler dump.
(gdb)
(gdb) stepi
0x000000000001537e in ?? ()
(gdb)

Program received signal SIGILL, Illegal instruction.
0x00000000000385d4 in ?? ()
(gdb)
Did you try to compare the call trace to QEMU, where we divert?
Yes, that's possible way forward, but it will require some
considerable setup on my side.

If you have QEMU ready... objdump tells you ldconfig's
entrypoint,
from that point you can just stepi. In less than 200 steps, you
should have sigill... and complete steps that lead to it.
I've updated sid-ports (dropped the snapshot pinning), and now I'm
getting a page fault on the instruction before the one that was
causing SIGILL before:
In case the requested page is a page with PROT_WRITE only (no PROT_READ)
it might be related to
https://lore.kernel.org/linux-riscv/20220915193702.2201018-1-abrestic@rivosinc.com/

AFAIR all stable branches have that problem currently.
Nice idea. I quickly hacked that on top of the rzfive kernel, but it
didn't change the picture, unfortunately.

That said, being able to test linus/master would be very valuable here.

Jan

--
Siemens AG, Technology
Competence Center Embedded Linux


Lad Prabhakar
 

Hi Jan,

-----Original Message-----
From: Jan Kiszka <jan.kiszka@...>
Sent: 11 October 2022 21:15
To: Florian Bezdeka <florian.bezdeka@...>; cip-dev@...; Chris Paterson
<Chris.Paterson2@...>; Prabhakar Mahadev Lad <prabhakar.mahadev-lad.rj@...>; Hung
Tran <hung.tran.jy@...>
Cc: Pavel Machek <pavel@...>
Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing isar-cip-core for RZ/Five

On 11.10.22 20:51, Florian Bezdeka wrote:
On 11.10.22 12:34, Biju Das via lists.cip-project.org wrote:
Subject: Re: RE: [cip-dev] ldconfig segfault on RZ/Five was Re:
Preparing isar-cip-core for RZ/Five

On 09.10.22 10:42, Biju Das wrote:
Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re:
Preparing
isar-cip-core for RZ/Five

On 08.10.22 10:27, Jan Kiszka wrote:
On 07.10.22 12:19, Pavel Machek wrote:
Hi!

I tried, but installation fails - illegal instruction.
Yeah, ldconfig is needed for installation. But I get a
segfaulting
gcc binary.
It crashes rather soon after startup, so I was able to trace
complete path.

But I do have slightly different results then you (I think;
I'm
far from risc-v expert). I did a breakpoint:

Breakpoint 1, 0x00000000000385d4 in ?? ()
I believe it should not end at 0x00000000000385d4 at all. The
0x000000000001537e jal instruction should end up calling
0x3806a
AFAICT, but it calls 0x385d4 instead. It happens during
single-stepping, so it should not be anything subtle.

(gdb) disassemble $pc,+0x20
Dump of assembler code from 0x1537c to 0x1539c:
=> 0x000000000001537c: mv a0,a4
0x000000000001537e: jal ra,0x3806a
0x0000000000015382: auipc a5,0x8a
0x0000000000015386: addi a5,a5,1342 # 0x9f8c0
0x000000000001538a: ld a4,0(a5)
0x000000000001538c: beqz a4,0x153f0
0x000000000001538e: jal ra,0x38abe
0x0000000000015392: ld a0,0(s6)
0x0000000000015396: auipc s7,0x85
0x000000000001539a: ld s7,-406(s7) # 0x9a200
End of assembler dump.
(gdb)
(gdb) stepi
0x000000000001537e in ?? ()
(gdb)

Program received signal SIGILL, Illegal instruction.
0x00000000000385d4 in ?? ()
(gdb)
Did you try to compare the call trace to QEMU, where we divert?
Yes, that's possible way forward, but it will require some
considerable setup on my side.

If you have QEMU ready... objdump tells you ldconfig's
entrypoint,
from that point you can just stepi. In less than 200 steps, you
should have sigill... and complete steps that lead to it.
I've updated sid-ports (dropped the snapshot pinning), and now
I'm getting a page fault on the instruction before the one that
was causing SIGILL before:
In case the requested page is a page with PROT_WRITE only (no
PROT_READ) it might be related to
https://jpn01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore
.kernel.org%2Flinux-riscv%2F20220915193702.2201018-1-abrestic%40rivosi
nc.com%2F&amp;data=05%7C01%7Cprabhakar.mahadev-lad.rj%40bp.renesas.com
%7C4efefe2d9ed148944efd08daabc55ab5%7C53d82571da1947e49cb4625a166a4a2a
%7C0%7C0%7C638011161361337108%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&amp
;sdata=msPQCy0siXTQOmhj7gAtCK1zSQChGNg%2B2KcmAhQvH4k%3D&amp;reserved=0

AFAIR all stable branches have that problem currently.
Nice idea. I quickly hacked that on top of the rzfive kernel, but it didn't change the picture,
unfortunately.
Thanks for the quick test.

That said, being able to test linus/master would be very valuable here.
I will test this on top of v6.0 and update the results.

Cheers,
Prabhakar


Lad Prabhakar
 

Hi Jan,

-----Original Message-----
From: Prabhakar Mahadev Lad
Sent: 11 October 2022 21:49
To: Jan Kiszka <jan.kiszka@...>; Florian Bezdeka
<florian.bezdeka@...>; cip-dev@...; Chris
Paterson <Chris.Paterson2@...>; Hung Tran
<hung.tran.jy@...>
Cc: Pavel Machek <pavel@...>
Subject: RE: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing
isar-cip-core for RZ/Five

Hi Jan,

-----Original Message-----
From: Jan Kiszka <jan.kiszka@...>
Sent: 11 October 2022 21:15
To: Florian Bezdeka <florian.bezdeka@...>;
cip-dev@...; Chris Paterson
<Chris.Paterson2@...>; Prabhakar Mahadev Lad
<prabhakar.mahadev-lad.rj@...>; Hung Tran
<hung.tran.jy@...>
Cc: Pavel Machek <pavel@...>
Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re:
Preparing
isar-cip-core for RZ/Five

On 11.10.22 20:51, Florian Bezdeka wrote:
On 11.10.22 12:34, Biju Das via lists.cip-project.org wrote:
Subject: Re: RE: [cip-dev] ldconfig segfault on RZ/Five was Re:
Preparing isar-cip-core for RZ/Five

On 09.10.22 10:42, Biju Das wrote:
Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re:
Preparing
isar-cip-core for RZ/Five

On 08.10.22 10:27, Jan Kiszka wrote:
On 07.10.22 12:19, Pavel Machek wrote:
Hi!

I tried, but installation fails - illegal instruction.
Yeah, ldconfig is needed for installation. But I get a
segfaulting
gcc binary.
It crashes rather soon after startup, so I was able to
trace
complete path.

But I do have slightly different results then you (I
think;
I'm
far from risc-v expert). I did a breakpoint:

Breakpoint 1, 0x00000000000385d4 in ?? ()
I believe it should not end at 0x00000000000385d4 at all.
The 0x000000000001537e jal instruction should end up
calling
0x3806a
AFAICT, but it calls 0x385d4 instead. It happens during
single-stepping, so it should not be anything subtle.

(gdb) disassemble $pc,+0x20
Dump of assembler code from 0x1537c to 0x1539c:
=> 0x000000000001537c: mv a0,a4
0x000000000001537e: jal ra,0x3806a
0x0000000000015382: auipc a5,0x8a
0x0000000000015386: addi a5,a5,1342 # 0x9f8c0
0x000000000001538a: ld a4,0(a5)
0x000000000001538c: beqz a4,0x153f0
0x000000000001538e: jal ra,0x38abe
0x0000000000015392: ld a0,0(s6)
0x0000000000015396: auipc s7,0x85
0x000000000001539a: ld s7,-406(s7) # 0x9a200
End of assembler dump.
(gdb)
(gdb) stepi
0x000000000001537e in ?? ()
(gdb)

Program received signal SIGILL, Illegal instruction.
0x00000000000385d4 in ?? ()
(gdb)
Did you try to compare the call trace to QEMU, where we
divert?

Yes, that's possible way forward, but it will require some
considerable setup on my side.

If you have QEMU ready... objdump tells you ldconfig's
entrypoint,
from that point you can just stepi. In less than 200 steps,
you should have sigill... and complete steps that lead to
it.
I've updated sid-ports (dropped the snapshot pinning), and
now
I'm getting a page fault on the instruction before the one
that
was causing SIGILL before:
In case the requested page is a page with PROT_WRITE only (no
PROT_READ) it might be related to
https://jpn01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo
re
.kernel.org%2Flinux-riscv%2F20220915193702.2201018-1-
abrestic%40rivo
si
nc.com%2F&amp;data=05%7C01%7Cprabhakar.mahadev-
lad.rj%40bp.renesas.c
om
%7C4efefe2d9ed148944efd08daabc55ab5%7C53d82571da1947e49cb4625a166a4a
2a
%7C0%7C0%7C638011161361337108%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
Aw
MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&a
mp
;sdata=msPQCy0siXTQOmhj7gAtCK1zSQChGNg%2B2KcmAhQvH4k%3D&amp;reserved
=0

AFAIR all stable branches have that problem currently.
Nice idea. I quickly hacked that on top of the rzfive kernel, but it
didn't change the picture, unfortunately.
Thanks for the quick test.

That said, being able to test linus/master would be very valuable
here.
I will test this on top of v6.0 and update the results.
I did a quick test with the patches pointed by Florian but unfortunately ldconfig still fails.

Cheers,
Prabhakar


Ulrich Hecht
 

On 10/12/2022 11:50 AM CEST Lad Prabhakar <prabhakar.mahadev-lad.rj@...> wrote:
I did a quick test with the patches pointed by Florian but unfortunately ldconfig still fails.
I did some experiments on RZ/Five with this issue, and I'm almost positive that there is something wrong (or doesn't work as documented) with the icache handling on this SoC.

1. The issue only affects non-PIE executables (there are very few of those, basically just ldconfig, gcc, cpp and gcov* on the Debian system), and it occurs very early during the execution of the program. According to the datasheet, the cache on the ax45mp-1c core is virtually indexed, so it is unlikely that a PIE executable will ever hit anything in the cache when newly loaded, but it is much more likely with non-PIE executables.

2. Setting a breakpoint before the illegal/segfaulting instruction doesn't work, and what is executed is clearly not what we're seeing through the dcache (the offending instructions are neither illegal, nor are they able to cause segfaults), so instruction fetches must see something different.

3. Neither manually calling __vdso_flush_icache() from gdb (which executes a "fence.i" instruction) nor patching a "fence.i" into the ldconfig binary seem to do anything. According to the ax45mp-1c datasheet "fence.i" should flush the dcache and invalidate the icache.

My educated guess is that, in spite of the claims in the core manual, the "fence.i" instruction is not implemented, or not implemented correctly. (The datasheet does acknowledge that "fence", without the ".i", is a nop.)

The RISC-V ISA manual says that "fence.i" is part of the optional "Zifencei" extension, which I don't see mentioned in the core datasheet anywhere. (And at least at first glance, I couldn't find any other mechanism to invalidate the icache there either.)

CU
Uli


Pavel Machek
 

Hi!

(Can I get you to wrap emails at ~72 columns or so?)

I did a quick test with the patches pointed by Florian but unfortunately ldconfig still fails.
I did some experiments on RZ/Five with this issue, and I'm almost positive that there is something wrong (or doesn't work as documented) with the icache handling on this SoC.
1. The issue only affects non-PIE executables (there are very few
of those, basically just ldconfig, gcc, cpp and gcov* on the Debian
system), and it occurs very early during the execution of the
program. According to the datasheet, the cache on the ax45mp-1c core
is virtually indexed, so it is unlikely that a PIE executable will
ever hit anything in the cache when newly loaded, but it is much
more likely with non-PIE executables.
Ah, I was wondering what does gcc and ldconfig have in common...

2. Setting a breakpoint before the illegal/segfaulting instruction
doesn't work, and what is executed is clearly not what we're seeing
through the dcache (the offending instructions are neither illegal,
nor are they able to cause segfaults), so instruction fetches must
see something different.
In my testing, I was able to stepi from the start, and then I was able
to put breakpoint at preceding instruction (which was a jump). It
looked like we jumped into the middle of instruction, which would
explain the fault.

Best regards,
Pavel

--
DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany


Pavel Machek
 

Hi!

On 10/12/2022 11:50 AM CEST Lad Prabhakar <prabhakar.mahadev-lad.rj@...> wrote:
I did a quick test with the patches pointed by Florian but unfortunately ldconfig still fails.
I did some experiments on RZ/Five with this issue, and I'm almost positive that there is something wrong (or doesn't work as documented) with the icache handling on this SoC.

1. The issue only affects non-PIE executables (there are very few of those, basically just ldconfig, gcc, cpp and gcov* on the Debian system), and it occurs very early during the execution of the program. According to the datasheet, the cache on the ax45mp-1c core is virtually indexed, so it is unlikely that a PIE executable will ever hit anything in the cache when newly loaded, but it is much more likely with non-PIE executables.
This is very good observation. Thanks!

And indeed it looks like _any_ non-PIE executable fails. See:

root@smarc-rzfive:/my# cat mytest.c
#include <stdio.h>

void main(void) { printf("ahoj svete\n"); }
root@smarc-rzfive:/my# clang mytest.c -fno-pie -static
mytest.c:3:1: warning: return type of 'main' is not 'int' [-Wmain-return-type]
void main(void) { printf("ahoj svete\n"); }
^
mytest.c:3:1: note: change return type to 'int'
void main(void) { printf("ahoj svete\n"); }
^~~~
int
1 warning generated.
root@smarc-rzfive:/my# ./a.out
[ 279.010424] a.out[214]: unhandled signal 11 code 0x1 at 0xffffff8c38bd1524


(-O3 -g might be useful to add to clang command line).

Then you can

b _dl_discover_osversion
run

(gdb) disassemble /r
Dump of assembler code for function _dl_discover_osversion:
0x000000000002538a <+0>: 41 71 addi sp,sp,-496
0x000000000002538c <+2>: a8 00 addi a0,sp,72
0x000000000002538e <+4>: 86 f7 sd ra,488(sp)
0x0000000000025390 <+6>: a2 f3 sd s0,480(sp)
0x0000000000025392 <+8>: a6 ef sd s1,472(sp)
0x0000000000025394 <+10>: ca eb sd s2,464(sp)
=> 0x0000000000025396 <+12>: ef 60 a1 5c jal ra,0x3b960 <uname>
0x000000000002539a <+16>: 93 05 a1 0c addi a1,sp,202
0x000000000002539e <+20>: 49 e5 bnez a0,0x25428 <_dl_discover_osversion+158>
0x00000000000253a0 <+22>: 81 48 li a7,0
0x00000000000253a2 <+24>: 01 45 li a0,0
0x00000000000253a4 <+26>: 25 48 li a6,9
0x00000000000253a6 <+28>: 13 03 e0 02 li t1,46

It clearly tries to call uname, which.. it should, according to the
source code. But somehow it ends up in completely different function:

(gdb) stepi

Program received signal SIGILL, Illegal instruction.
0x000000000003b2fe in wcsrtombs ()

Best regards,
Pavel
--
DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany


Lad Prabhakar
 

Hi All,

-----Original Message-----
From: Pavel Machek <pavel@...>
Sent: 13 October 2022 22:48
To: Ulrich Hecht <uli@...>
Cc: cip-dev@...; Prabhakar Mahadev Lad <prabhakar.mahadev-lad.rj@...>;
Jan Kiszka <jan.kiszka@...>; Florian Bezdeka <florian.bezdeka@...>; Chris Paterson
<Chris.Paterson2@...>; Hung Tran <hung.tran.jy@...>; Pavel Machek <pavel@...>
Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing isar-cip-core for RZ/Five

Hi!

On 10/12/2022 11:50 AM CEST Lad Prabhakar <prabhakar.mahadev-lad.rj@...> wrote:
I did a quick test with the patches pointed by Florian but unfortunately ldconfig still fails.
I did some experiments on RZ/Five with this issue, and I'm almost positive that there is something
wrong (or doesn't work as documented) with the icache handling on this SoC.

1. The issue only affects non-PIE executables (there are very few of those, basically just ldconfig,
gcc, cpp and gcov* on the Debian system), and it occurs very early during the execution of the
program. According to the datasheet, the cache on the ax45mp-1c core is virtually indexed, so it is
unlikely that a PIE executable will ever hit anything in the cache when newly loaded, but it is much
more likely with non-PIE executables.
This is very good observation. Thanks!

And indeed it looks like _any_ non-PIE executable fails. See:
Just a brief about the issue and solution:

TEXT_START_ADDR is the start of text segment of an application. This is being set to 0x10000 for RISCV platforms.

So when an application is compiled with the static flag the load would start from 0x10000 - xyz (depending on size of the application)

Entry point 0x101c0
There are 5 program headers, starting at offset 64Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000010000 0x0000000000010000
0x0000000000059b48 0x0000000000059b48 R E 0x1000
LOAD 0x0000000000059b60 0x000000000006ab60 0x000000000006ab60
0x0000000000001f68 0x0000000000003528 RW 0x1000
So for the above application which is compiled statically we can see the entry point is 0x101c0 and load 0x0000000000010000.

Andes cores have local memories ILM and DLM that are mapped in the region H'0_0003_0000 - H'0_0004_FFFF on the RZ/Five SoC. When the virtual address falls in this range the MMU doesnt trigger a page fault and assume the virtual address as physical address and hence the application fails to run (panics somewhere).

So to avoid this issue we set the TEXT_START_ADDR to 0x50000 so that virtual address of any statically compiled application doesnt fall in the range of H'0_0003_0000 - H'0_0004_FFFF.

Elf file type is EXEC (Executable file)
Entry point 0x504e4
There are 5 program headers, starting at offset 64

Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000050000 0x0000000000050000
0x0000000000057dc8 0x0000000000057dc8 R E 0x1000
LOAD 0x00000000000585b8 0x00000000000a95b8 0x00000000000a95b8
0x0000000000004ee0 0x00000000000064b0 RW 0x1000
NOTE 0x0000000000000158 0x0000000000050158 0x0000000000050158
0x0000000000000044 0x0000000000000044 R 0x4

So now with the fix for statically compiled application we can see its offsetted and entry point is 0x504e4 and load is at 0x0000000000050000. So with this we are for sure the MMU will always trigger a page fault.

I have attached a patch for binutils to the email. We plan to upstream this patch to binutils soon.

Cheers,
Prabhakar


Jan Kiszka
 

On 29.11.22 19:57, Prabhakar Mahadev Lad wrote:
Hi All,

-----Original Message-----
From: Pavel Machek <pavel@...>
Sent: 13 October 2022 22:48
To: Ulrich Hecht <uli@...>
Cc: cip-dev@...; Prabhakar Mahadev Lad <prabhakar.mahadev-lad.rj@...>;
Jan Kiszka <jan.kiszka@...>; Florian Bezdeka <florian.bezdeka@...>; Chris Paterson
<Chris.Paterson2@...>; Hung Tran <hung.tran.jy@...>; Pavel Machek <pavel@...>
Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing isar-cip-core for RZ/Five

Hi!

On 10/12/2022 11:50 AM CEST Lad Prabhakar <prabhakar.mahadev-lad.rj@...> wrote:
I did a quick test with the patches pointed by Florian but unfortunately ldconfig still fails.
I did some experiments on RZ/Five with this issue, and I'm almost positive that there is something
wrong (or doesn't work as documented) with the icache handling on this SoC.

1. The issue only affects non-PIE executables (there are very few of those, basically just ldconfig,
gcc, cpp and gcov* on the Debian system), and it occurs very early during the execution of the
program. According to the datasheet, the cache on the ax45mp-1c core is virtually indexed, so it is
unlikely that a PIE executable will ever hit anything in the cache when newly loaded, but it is much
more likely with non-PIE executables.
This is very good observation. Thanks!

And indeed it looks like _any_ non-PIE executable fails. See:
Just a brief about the issue and solution:

TEXT_START_ADDR is the start of text segment of an application. This is being set to 0x10000 for RISCV platforms.

So when an application is compiled with the static flag the load would start from 0x10000 - xyz (depending on size of the application)

Entry point 0x101c0
There are 5 program headers, starting at offset 64Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000010000 0x0000000000010000
0x0000000000059b48 0x0000000000059b48 R E 0x1000
LOAD 0x0000000000059b60 0x000000000006ab60 0x000000000006ab60
0x0000000000001f68 0x0000000000003528 RW 0x1000
So for the above application which is compiled statically we can see the entry point is 0x101c0 and load 0x0000000000010000.

Andes cores have local memories ILM and DLM that are mapped in the region H'0_0003_0000 - H'0_0004_FFFF on the RZ/Five SoC. When the virtual address falls in this range the MMU doesnt trigger a page fault and assume the virtual address as physical address and hence the application fails to run (panics somewhere).

So to avoid this issue we set the TEXT_START_ADDR to 0x50000 so that virtual address of any statically compiled application doesnt fall in the range of H'0_0003_0000 - H'0_0004_FFFF.

Elf file type is EXEC (Executable file)
Entry point 0x504e4
There are 5 program headers, starting at offset 64

Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000050000 0x0000000000050000
0x0000000000057dc8 0x0000000000057dc8 R E 0x1000
LOAD 0x00000000000585b8 0x00000000000a95b8 0x00000000000a95b8
0x0000000000004ee0 0x00000000000064b0 RW 0x1000
NOTE 0x0000000000000158 0x0000000000050158 0x0000000000050158
0x0000000000000044 0x0000000000000044 R 0x4

So now with the fix for statically compiled application we can see its offsetted and entry point is 0x504e4 and load is at 0x0000000000050000. So with this we are for sure the MMU will always trigger a page fault.

I have attached a patch for binutils to the email. We plan to upstream this patch to binutils soon.
Good that the issue is understood and likely solved now. Make sure to
upstream this as quickly as possible. It targets a fundamental tool and
requires recompilation of many components. And Debian will freeze the
toolchain in early January - although:

"It is unlikely that the release arch of bookworm will include riscv64."
[1] :(

Jan

[1] https://lists.debian.org/debian-riscv/2022/12/msg00009.html

--
Siemens AG, Technology
Competence Center Embedded Linux


Pavel Machek
 

Hi!

This is very good observation. Thanks!

And indeed it looks like _any_ non-PIE executable fails. See:
Just a brief about the issue and solution:

TEXT_START_ADDR is the start of text segment of an application. This is being set to 0x10000 for RISCV platforms.

So when an application is compiled with the static flag the load would start from 0x10000 - xyz (depending on size of the application)

Entry point 0x101c0
There are 5 program headers, starting at offset 64Program Headers:
Type Offset VirtAddr PhysAddr
FileSiz MemSiz Flags Align
LOAD 0x0000000000000000 0x0000000000010000 0x0000000000010000
0x0000000000059b48 0x0000000000059b48 R E 0x1000
LOAD 0x0000000000059b60 0x000000000006ab60 0x000000000006ab60
0x0000000000001f68 0x0000000000003528 RW 0x1000
So for the above application which is compiled statically we can see the entry point is 0x101c0 and load 0x0000000000010000.

Andes cores have local memories ILM and DLM that are mapped in the
region H'0_0003_0000 - H'0_0004_FFFF on the RZ/Five SoC. When the
virtual address falls in this range the MMU doesnt trigger a page
fault and assume the virtual address as physical address and hence
the application fails to run (panics somewhere).
...

Good that the issue is understood and likely solved now. Make sure to
upstream this as quickly as possible. It targets a fundamental tool and
requires recompilation of many components. And Debian will freeze the
toolchain in early January - although:

"It is unlikely that the release arch of bookworm will include riscv64."
[1] :(
I'm pretty sure this is not complete fix. Yes, we should change the
toolchain, but the problem is really in the hardware: you can't just
take part of _virtual_ address space and reserve it. Not if you want
to claim board is riscv64 compatible. Someone else (manual mmap, some
kind of JIT, some kind of emulator) might want normal RAM there.

I believe this is quite important and should be solved in hardware (at
least in next generation).

Can ILM/DLM be disabled?

If we can not fix it at hardware level, we'll really need to prevent
attempts to map anything at that virtual memory range. Clear -EPERM
from mmap is better than strange behaviour at runtime, and it is
must-have from security perspective.

Best regards,
Pavel
--
DENX Software Engineering GmbH, Managing Director: Wolfgang Denk
HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany