Hi! Hmm, seems the issue persists: :-(. Do you get gcc faulting, too? I tried, but installation fails - illegal instruction.
Yeah, ldconfig is needed for installation. But I get a segfaulting gcc binary. root@demo:~# ldconfig
[ 297.146728] ldconfig[497]: unhandled signal 4 code 0x1 at 0x00000000000380c8 in ldconfig[10000+83000] ...
(gdb) disassemble $pc,+0x10 Dump of assembler code from 0x380c8 to 0x380d8: => 0x00000000000380c8: auipc a2,0x66 0x00000000000380cc: addi a2,a2,2000 # 0x9e898 0x00000000000380d0: sd a0,0(a2) auipc is something rather simple. a2 = pc + 0x66 << something. Not sure how it could fault. Plus we get "illegal instruction", suggesting it is not some other fault.
Could some kind of self-modifying code be involved? I guess some kind of debugging/watchpoint is not probable. No idea - but why should ldconfig be self-modifying?
No idea. But I do have slightly different results then you (I think; I'm far from risc-v expert). I did a breakpoint: Breakpoint 1, 0x00000000000385d4 in ?? () (gdb) Dump of assembler code from 0x385d4 to 0x385f4: => 0x00000000000385d4: lb zero,81(t1) 0x00000000000385d8: andi a1,a1,25 0x00000000000385da: sd zero,24(sp) 0x00000000000385dc: sd zero,32(sp) If I do the stepi, it will give the illegal instruction, because, well, we are in the middle of the auipc instruction: (gdb) disassemble $pc-0x10,+0x20 Dump of assembler code from 0x385c4 to 0x385e4: 0x00000000000385c4: .4byte 0x4881f753 0x00000000000385c8: li a6,0 0x00000000000385ca: li a5,0 0x00000000000385cc: addi a3,a1,920 0x00000000000385d0: mv a2,s8 0x00000000000385d2: auipc a0,0x3f 0x00000000000385d6: addi a0,a0,-1890 # 0x76e70 0x00000000000385da: sd zero,24(sp) 0x00000000000385dc: sd zero,32(sp) 0x00000000000385de: sb t3,20(sp) 0x00000000000385e2: sd s7,40(sp) End of assembler dump. (gdb) Weird. But it explains sigill when executing auipc does not result in segfault... Best regards, Pavel -- DENX Software Engineering GmbH, Managing Director: Wolfgang Denk HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
|
|
Hi! I tried, but installation fails - illegal instruction. Yeah, ldconfig is needed for installation. But I get a segfaulting gcc binary.
It crashes rather soon after startup, so I was able to trace complete path. But I do have slightly different results then you (I think; I'm far from risc-v expert). I did a breakpoint:
Breakpoint 1, 0x00000000000385d4 in ?? () I believe it should not end at 0x00000000000385d4 at all. The 0x000000000001537e jal instruction should end up calling 0x3806a AFAICT, but it calls 0x385d4 instead. It happens during single-stepping, so it should not be anything subtle. (gdb) disassemble $pc,+0x20 Dump of assembler code from 0x1537c to 0x1539c: => 0x000000000001537c: mv a0,a4 0x000000000001537e: jal ra,0x3806a 0x0000000000015382: auipc a5,0x8a 0x0000000000015386: addi a5,a5,1342 # 0x9f8c0 0x000000000001538a: ld a4,0(a5) 0x000000000001538c: beqz a4,0x153f0 0x000000000001538e: jal ra,0x38abe 0x0000000000015392: ld a0,0(s6) 0x0000000000015396: auipc s7,0x85 0x000000000001539a: ld s7,-406(s7) # 0x9a200 End of assembler dump. (gdb) (gdb) stepi 0x000000000001537e in ?? () (gdb) Program received signal SIGILL, Illegal instruction. 0x00000000000385d4 in ?? () (gdb) Best regards, Pavel -- DENX Software Engineering GmbH, Managing Director: Wolfgang Denk HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
|
|
On 07.10.22 00:32, Pavel Machek wrote: Hi!
I tried, but installation fails - illegal instruction. Yeah, ldconfig is needed for installation. But I get a segfaulting gcc binary. It crashes rather soon after startup, so I was able to trace complete path.
But I do have slightly different results then you (I think; I'm far from risc-v expert). I did a breakpoint:
Breakpoint 1, 0x00000000000385d4 in ?? () I believe it should not end at 0x00000000000385d4 at all. The 0x000000000001537e jal instruction should end up calling 0x3806a AFAICT, but it calls 0x385d4 instead. It happens during single-stepping, so it should not be anything subtle.
(gdb) disassemble $pc,+0x20 Dump of assembler code from 0x1537c to 0x1539c: => 0x000000000001537c: mv a0,a4 0x000000000001537e: jal ra,0x3806a 0x0000000000015382: auipc a5,0x8a 0x0000000000015386: addi a5,a5,1342 # 0x9f8c0 0x000000000001538a: ld a4,0(a5) 0x000000000001538c: beqz a4,0x153f0 0x000000000001538e: jal ra,0x38abe 0x0000000000015392: ld a0,0(s6) 0x0000000000015396: auipc s7,0x85 0x000000000001539a: ld s7,-406(s7) # 0x9a200 End of assembler dump. (gdb) (gdb) stepi 0x000000000001537e in ?? () (gdb)
Program received signal SIGILL, Illegal instruction. 0x00000000000385d4 in ?? () (gdb)
Best regards, Pavel Did you try to compare the call trace to QEMU, where we divert? Jan -- Siemens AG, Technology Competence Center Embedded Linux
|
|
Hi! I tried, but installation fails - illegal instruction. Yeah, ldconfig is needed for installation. But I get a segfaulting gcc binary. It crashes rather soon after startup, so I was able to trace complete path.
But I do have slightly different results then you (I think; I'm far from risc-v expert). I did a breakpoint:
Breakpoint 1, 0x00000000000385d4 in ?? () I believe it should not end at 0x00000000000385d4 at all. The 0x000000000001537e jal instruction should end up calling 0x3806a AFAICT, but it calls 0x385d4 instead. It happens during single-stepping, so it should not be anything subtle.
(gdb) disassemble $pc,+0x20 Dump of assembler code from 0x1537c to 0x1539c: => 0x000000000001537c: mv a0,a4 0x000000000001537e: jal ra,0x3806a 0x0000000000015382: auipc a5,0x8a 0x0000000000015386: addi a5,a5,1342 # 0x9f8c0 0x000000000001538a: ld a4,0(a5) 0x000000000001538c: beqz a4,0x153f0 0x000000000001538e: jal ra,0x38abe 0x0000000000015392: ld a0,0(s6) 0x0000000000015396: auipc s7,0x85 0x000000000001539a: ld s7,-406(s7) # 0x9a200 End of assembler dump. (gdb) (gdb) stepi 0x000000000001537e in ?? () (gdb)
Program received signal SIGILL, Illegal instruction. 0x00000000000385d4 in ?? () (gdb) Did you try to compare the call trace to QEMU, where we divert?
Yes, that's possible way forward, but it will require some considerable setup on my side. If you have QEMU ready... objdump tells you ldconfig's entrypoint, from that point you can just stepi. In less than 200 steps, you should have sigill... and complete steps that lead to it. Best regards, Pavel -- DENX Software Engineering GmbH, Managing Director: Wolfgang Denk HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
|
|
On 07.10.22 12:19, Pavel Machek wrote: Hi!
I tried, but installation fails - illegal instruction. Yeah, ldconfig is needed for installation. But I get a segfaulting gcc binary. It crashes rather soon after startup, so I was able to trace complete path.
But I do have slightly different results then you (I think; I'm far from risc-v expert). I did a breakpoint:
Breakpoint 1, 0x00000000000385d4 in ?? () I believe it should not end at 0x00000000000385d4 at all. The 0x000000000001537e jal instruction should end up calling 0x3806a AFAICT, but it calls 0x385d4 instead. It happens during single-stepping, so it should not be anything subtle.
(gdb) disassemble $pc,+0x20 Dump of assembler code from 0x1537c to 0x1539c: => 0x000000000001537c: mv a0,a4 0x000000000001537e: jal ra,0x3806a 0x0000000000015382: auipc a5,0x8a 0x0000000000015386: addi a5,a5,1342 # 0x9f8c0 0x000000000001538a: ld a4,0(a5) 0x000000000001538c: beqz a4,0x153f0 0x000000000001538e: jal ra,0x38abe 0x0000000000015392: ld a0,0(s6) 0x0000000000015396: auipc s7,0x85 0x000000000001539a: ld s7,-406(s7) # 0x9a200 End of assembler dump. (gdb) (gdb) stepi 0x000000000001537e in ?? () (gdb)
Program received signal SIGILL, Illegal instruction. 0x00000000000385d4 in ?? () (gdb) Did you try to compare the call trace to QEMU, where we divert? Yes, that's possible way forward, but it will require some considerable setup on my side.
If you have QEMU ready... objdump tells you ldconfig's entrypoint, from that point you can just stepi. In less than 200 steps, you should have sigill... and complete steps that lead to it.
I've updated sid-ports (dropped the snapshot pinning), and now I'm getting a page fault on the instruction before the one that was causing SIGILL before: [ 558.490689] CPU: 0 PID: 3212 Comm: ldconfig Not tainted 5.10.83-cip1-riscv-renesas #1 [ 558.490697] epc: 00000000000380c6 ra : 0000000000015382 sp : 0000003fff9e3c10 [ 558.490703] gp : 0000000000099da8 tp : 0000003fe9c3c800 t0 : 0000003fe9c427c0 [ 558.490710] t1 : 0000003fe9cd059c t2 : 0000002acb8f2c00 s0 : 0000002b079e9510 [ 558.490716] s1 : 0000000000000001 a0 : 0000003fff9e3d18 a1 : 0000000000000001 [ 558.490722] a2 : 0000003fff9e3c88 a3 : 0000000000000000 a4 : 0000003fff9e3d18 [ 558.490728] a5 : 000000000009736e a6 : 0000003fff9e3c80 a7 : 00000000000000dd [ 558.490734] s2 : 0000003fff9e3c88 s3 : 0000000000000000 s4 : 0000000000000000 [ 558.490740] s5 : 00000000000105a4 s6 : 000000000009e670 s7 : 0000002b079c8ab0 [ 558.490746] s8 : 0000002b079e91c0 s9 : 0000000000000000 s10: 0000002acb8fc9b0 [ 558.490752] s11: 0000002acb8fc920 t3 : 0000002acb80f5d8 t4 : 000000000009259c [ 558.490758] t5 : 0000000000000004 t6 : 0000002b0799c010 [ 558.490764] status: 0000000200004020 badaddr: 00000000000000e1 cause: 000000000000000f (gdb) disassemble 0x00000000000380c6,+0x10 Dump of assembler code from 0x380c6 to 0x380d6: 0x00000000000380c6: addi sp,sp,-416 0x00000000000380c8: auipc a2,0x66 0x00000000000380cc: addi a2,a2,2000 # 0x9e898 0x00000000000380d0: sd a0,0(a2) 0x00000000000380d2: mv a5,sp 0x00000000000380d4: addi a4,sp,416 End of assembler dump. I've stepped this through under qemu as well, and the control flow is identical. Registers are almost the same, except for some temporaries: --- regs-qemu +++ regs-rzfive @@ -2,9 +2,9 @@ ra 0x15382 0x15382 sp 0x3ffffffbe0 0x3ffffffbe0 gp 0x99da8 0x99da8 -tp 0x3ff7e77800 0x3ff7e77800 -t0 0x3ff7e7d7c0 274742106048 -t1 0x3ff7f0b59c 274742687132 +tp 0x3ff7e78800 0x3ff7e78800 +t0 0x3ff7e7e7c0 274742110144 +t1 0x3ff7f0c59c 274742691228 t2 0x2aaab92c00 183252888576 fp 0x2aaabaee00 0x2aaabaee00 s1 0x1 1 No idea if that is normal (different machines, different memory sizes and layouts) or a symptom of the problem. Jan -- Siemens AG, Technology Competence Center Embedded Linux
|
|
On 08.10.22 10:27, Jan Kiszka wrote: On 07.10.22 12:19, Pavel Machek wrote:
Hi!
I tried, but installation fails - illegal instruction. Yeah, ldconfig is needed for installation. But I get a segfaulting gcc binary. It crashes rather soon after startup, so I was able to trace complete path.
But I do have slightly different results then you (I think; I'm far from risc-v expert). I did a breakpoint:
Breakpoint 1, 0x00000000000385d4 in ?? () I believe it should not end at 0x00000000000385d4 at all. The 0x000000000001537e jal instruction should end up calling 0x3806a AFAICT, but it calls 0x385d4 instead. It happens during single-stepping, so it should not be anything subtle.
(gdb) disassemble $pc,+0x20 Dump of assembler code from 0x1537c to 0x1539c: => 0x000000000001537c: mv a0,a4 0x000000000001537e: jal ra,0x3806a 0x0000000000015382: auipc a5,0x8a 0x0000000000015386: addi a5,a5,1342 # 0x9f8c0 0x000000000001538a: ld a4,0(a5) 0x000000000001538c: beqz a4,0x153f0 0x000000000001538e: jal ra,0x38abe 0x0000000000015392: ld a0,0(s6) 0x0000000000015396: auipc s7,0x85 0x000000000001539a: ld s7,-406(s7) # 0x9a200 End of assembler dump. (gdb) (gdb) stepi 0x000000000001537e in ?? () (gdb)
Program received signal SIGILL, Illegal instruction. 0x00000000000385d4 in ?? () (gdb) Did you try to compare the call trace to QEMU, where we divert? Yes, that's possible way forward, but it will require some considerable setup on my side.
If you have QEMU ready... objdump tells you ldconfig's entrypoint, from that point you can just stepi. In less than 200 steps, you should have sigill... and complete steps that lead to it.
I've updated sid-ports (dropped the snapshot pinning), and now I'm getting a page fault on the instruction before the one that was causing SIGILL before:
[ 558.490689] CPU: 0 PID: 3212 Comm: ldconfig Not tainted 5.10.83-cip1-riscv-renesas #1 [ 558.490697] epc: 00000000000380c6 ra : 0000000000015382 sp : 0000003fff9e3c10 [ 558.490703] gp : 0000000000099da8 tp : 0000003fe9c3c800 t0 : 0000003fe9c427c0 [ 558.490710] t1 : 0000003fe9cd059c t2 : 0000002acb8f2c00 s0 : 0000002b079e9510 [ 558.490716] s1 : 0000000000000001 a0 : 0000003fff9e3d18 a1 : 0000000000000001 [ 558.490722] a2 : 0000003fff9e3c88 a3 : 0000000000000000 a4 : 0000003fff9e3d18 [ 558.490728] a5 : 000000000009736e a6 : 0000003fff9e3c80 a7 : 00000000000000dd [ 558.490734] s2 : 0000003fff9e3c88 s3 : 0000000000000000 s4 : 0000000000000000 [ 558.490740] s5 : 00000000000105a4 s6 : 000000000009e670 s7 : 0000002b079c8ab0 [ 558.490746] s8 : 0000002b079e91c0 s9 : 0000000000000000 s10: 0000002acb8fc9b0 [ 558.490752] s11: 0000002acb8fc920 t3 : 0000002acb80f5d8 t4 : 000000000009259c [ 558.490758] t5 : 0000000000000004 t6 : 0000002b0799c010 [ 558.490764] status: 0000000200004020 badaddr: 00000000000000e1 cause: 000000000000000f
(gdb) disassemble 0x00000000000380c6,+0x10 Dump of assembler code from 0x380c6 to 0x380d6: 0x00000000000380c6: addi sp,sp,-416 0x00000000000380c8: auipc a2,0x66 0x00000000000380cc: addi a2,a2,2000 # 0x9e898 0x00000000000380d0: sd a0,0(a2) 0x00000000000380d2: mv a5,sp 0x00000000000380d4: addi a4,sp,416 End of assembler dump.
I've stepped this through under qemu as well, and the control flow is identical. Registers are almost the same, except for some temporaries:
--- regs-qemu +++ regs-rzfive @@ -2,9 +2,9 @@ ra 0x15382 0x15382 sp 0x3ffffffbe0 0x3ffffffbe0 gp 0x99da8 0x99da8 -tp 0x3ff7e77800 0x3ff7e77800 -t0 0x3ff7e7d7c0 274742106048 -t1 0x3ff7f0b59c 274742687132 +tp 0x3ff7e78800 0x3ff7e78800 +t0 0x3ff7e7e7c0 274742110144 +t1 0x3ff7f0c59c 274742691228 t2 0x2aaab92c00 183252888576 fp 0x2aaabaee00 0x2aaabaee00 s1 0x1 1
No idea if that is normal (different machines, different memory sizes and layouts) or a symptom of the problem.
... OpenEmbedded nodistro.0 smarc-rzfive ttySC0 [ 12.829622] audit: type=1006 audit(1653987107.735:2): pid=156 uid=0 old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=1 res=1 root@smarc-rzfive:~# ldconfig [ 22.278868] ldconfig[166]: unhandled signal 11 code 0x1 at 0x0000000000000088 in ldconfig[10000+68000] [ 22.290244] CPU: 0 PID: 166 Comm: ldconfig Not tainted 5.10.83-cip1-riscv-renesas #1 [ 22.298954] epc: 0000000000030eea ra : 00000000000145a0 sp : 0000003fff9f8aa0 [ 22.306906] gp : 000000000007fe48 tp : 0000003fd958b720 t0 : 0000000000000000 [ 22.314973] t1 : 0000002adf9c3bbc t2 : 00000000000003ff s0 : 0000003fff9f8c90 [ 22.322986] s1 : 0000000000014b0e a0 : 0000003fff9f8c98 a1 : 0000000000000000 [ 22.330967] a2 : 0000003fff9f8be8 a3 : 0000000000014a86 a4 : 000000000007e576 [ 22.338936] a5 : 0000000000000000 a6 : 0000003fff9f8be0 a7 : 0000000000000000 [ 22.346897] s2 : 0000000000000000 s3 : 0000003fd96df918 s4 : ffffffffffffffff [ 22.354905] s5 : 0000002b01953f70 s6 : 0000002b01953c60 s7 : 0000002b019539b0 [ 22.362875] s8 : 0000002b01953b50 s9 : 0000000000000000 s10: 0000002adfa74584 [ 22.370884] s11: 0000000000000000 t3 : 0000003fd960ee18 t4 : 000000000000000f [ 22.378945] t5 : 000000000000000f t6 : 0000000000000000 [ 22.385051] status: 8000000200004020 badaddr: 0000000000000088 cause: 000000000000000d [ 22.393860] audit: type=1701 audit(1653987117.299:3): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=166 comm="ldconfig" exe="/sbin/ldconfig" sig=11 res=1 Segmentation fault That was the version I found on eMMC. I think you have some real homework now... Jan -- Siemens AG, Technology Competence Center Embedded Linux
|
|
Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing isar-cip-core for RZ/Five
On 08.10.22 10:27, Jan Kiszka wrote:
On 07.10.22 12:19, Pavel Machek wrote:
Hi!
I tried, but installation fails - illegal instruction. Yeah, ldconfig is needed for installation. But I get a
segfaulting
gcc binary. It crashes rather soon after startup, so I was able to trace complete path.
But I do have slightly different results then you (I think; I'm far from risc-v expert). I did a breakpoint:
Breakpoint 1, 0x00000000000385d4 in ?? () I believe it should not end at 0x00000000000385d4 at all. The 0x000000000001537e jal instruction should end up calling 0x3806a AFAICT, but it calls 0x385d4 instead. It happens during single-stepping, so it should not be anything subtle.
(gdb) disassemble $pc,+0x20 Dump of assembler code from 0x1537c to 0x1539c: => 0x000000000001537c: mv a0,a4 0x000000000001537e: jal ra,0x3806a 0x0000000000015382: auipc a5,0x8a 0x0000000000015386: addi a5,a5,1342 # 0x9f8c0 0x000000000001538a: ld a4,0(a5) 0x000000000001538c: beqz a4,0x153f0 0x000000000001538e: jal ra,0x38abe 0x0000000000015392: ld a0,0(s6) 0x0000000000015396: auipc s7,0x85 0x000000000001539a: ld s7,-406(s7) # 0x9a200 End of assembler dump. (gdb) (gdb) stepi 0x000000000001537e in ?? () (gdb)
Program received signal SIGILL, Illegal instruction. 0x00000000000385d4 in ?? () (gdb) Did you try to compare the call trace to QEMU, where we divert? Yes, that's possible way forward, but it will require some considerable setup on my side.
If you have QEMU ready... objdump tells you ldconfig's entrypoint, from that point you can just stepi. In less than 200 steps, you should have sigill... and complete steps that lead to it.
I've updated sid-ports (dropped the snapshot pinning), and now I'm getting a page fault on the instruction before the one that was causing SIGILL before:
[ 558.490689] CPU: 0 PID: 3212 Comm: ldconfig Not tainted 5.10.83-cip1-riscv-renesas #1 [ 558.490697] epc: 00000000000380c6 ra
: 0000000000015382 sp : 0000003fff9e3c10 [ 558.490703] gp : 0000000000099da8 tp : 0000003fe9c3c800 t0 : 0000003fe9c427c0 [ 558.490710] t1 : 0000003fe9cd059c t2 : 0000002acb8f2c00 s0 : 0000002b079e9510 [ 558.490716] s1 : 0000000000000001 a0 : 0000003fff9e3d18 a1 : 0000000000000001 [ 558.490722] a2 : 0000003fff9e3c88 a3 : 0000000000000000 a4 : 0000003fff9e3d18 [ 558.490728] a5 : 000000000009736e a6 : 0000003fff9e3c80 a7 : 00000000000000dd [ 558.490734] s2 : 0000003fff9e3c88 s3 : 0000000000000000 s4 : 0000000000000000 [ 558.490740] s5 : 00000000000105a4 s6 : 000000000009e670 s7 : 0000002b079c8ab0 [ 558.490746] s8 : 0000002b079e91c0 s9 : 0000000000000000 s10: 0000002acb8fc9b0 [ 558.490752] s11: 0000002acb8fc920 t3 : 0000002acb80f5d8 t4 : 000000000009259c [ 558.490758] t5 : 0000000000000004 t6 : 0000002b0799c010 [ 558.490764] status: 0000000200004020 badaddr: 00000000000000e1 cause: 000000000000000f
(gdb) disassemble 0x00000000000380c6,+0x10 Dump of assembler code from
0x380c6 to 0x380d6: 0x00000000000380c6: addi sp,sp,-416 0x00000000000380c8: auipc a2,0x66 0x00000000000380cc: addi a2,a2,2000 # 0x9e898 0x00000000000380d0: sd a0,0(a2) 0x00000000000380d2: mv a5,sp 0x00000000000380d4: addi a4,sp,416 End of assembler dump.
I've stepped this through under qemu as well, and the control flow is
identical. Registers are almost the same, except for some temporaries:
--- regs-qemu +++ regs-rzfive @@ -2,9 +2,9 @@ ra 0x15382 0x15382 sp 0x3ffffffbe0 0x3ffffffbe0 gp 0x99da8 0x99da8 -tp 0x3ff7e77800 0x3ff7e77800 -t0 0x3ff7e7d7c0 274742106048 -t1 0x3ff7f0b59c 274742687132 +tp 0x3ff7e78800 0x3ff7e78800 +t0 0x3ff7e7e7c0 274742110144 +t1 0x3ff7f0c59c 274742691228 t2 0x2aaab92c00 183252888576 fp 0x2aaabaee00 0x2aaabaee00 s1 0x1 1
No idea if that is normal (different machines, different memory sizes
and layouts) or a symptom of the problem.
... OpenEmbedded nodistro.0 smarc-rzfive ttySC0
[ 12.829622] audit: type=1006 audit(1653987107.735:2): pid=156 uid=0 old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=1 res=1 root@smarc-rzfive:~# ldconfig [ 22.278868] ldconfig[166]: unhandled signal 11 code 0x1 at 0x0000000000000088 in ldconfig[10000+68000] [ 22.290244] CPU: 0 PID: 166 Comm: ldconfig Not tainted 5.10.83- cip1-riscv-renesas #1 [ 22.298954] epc: 0000000000030eea ra : 00000000000145a0 sp : 0000003fff9f8aa0 [ 22.306906] gp : 000000000007fe48 tp : 0000003fd958b720 t0 : 0000000000000000 [ 22.314973] t1 : 0000002adf9c3bbc t2 : 00000000000003ff s0 : 0000003fff9f8c90 [ 22.322986] s1 : 0000000000014b0e a0 : 0000003fff9f8c98 a1 : 0000000000000000 [ 22.330967] a2 : 0000003fff9f8be8 a3 : 0000000000014a86 a4 : 000000000007e576 [ 22.338936] a5 : 0000000000000000 a6 : 0000003fff9f8be0 a7 : 0000000000000000 [ 22.346897] s2 : 0000000000000000 s3 : 0000003fd96df918 s4 : ffffffffffffffff [ 22.354905] s5 : 0000002b01953f70 s6 : 0000002b01953c60 s7 : 0000002b019539b0 [ 22.362875] s8 : 0000002b01953b50 s9 : 0000000000000000 s10: 0000002adfa74584 [ 22.370884] s11: 0000000000000000 t3 : 0000003fd960ee18 t4 : 000000000000000f [ 22.378945] t5 : 000000000000000f t6 : 0000000000000000 [ 22.385051] status: 8000000200004020 badaddr: 0000000000000088 cause: 000000000000000d [ 22.393860] audit: type=1701 audit(1653987117.299:3): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=166 comm="ldconfig" exe="/sbin/ldconfig" sig=11 res=1 Segmentation fault
That was the version I found on eMMC.
I think you have some real homework now... What is your conclusion? Is it tool chain related issue? Or cache related issue? Or Something else ? Cheers, Biju
|
|
Hi Jan, From: Jan Kiszka <jan.kiszka@...> Sent: 09 October 2022 09:29
On 08.10.22 10:27, Jan Kiszka wrote:
On 07.10.22 12:19, Pavel Machek wrote:
Hi!
I tried, but installation fails - illegal instruction. Yeah, ldconfig is needed for installation. But I get a segfaulting gcc binary. It crashes rather soon after startup, so I was able to trace complete path.
But I do have slightly different results then you (I think; I'm far from risc-v expert). I did a breakpoint:
Breakpoint 1, 0x00000000000385d4 in ?? () I believe it should not end at 0x00000000000385d4 at all. The 0x000000000001537e jal instruction should end up calling 0x3806a AFAICT, but it calls 0x385d4 instead. It happens during single-stepping, so it should not be anything subtle.
(gdb) disassemble $pc,+0x20 Dump of assembler code from 0x1537c to 0x1539c: => 0x000000000001537c: mv a0,a4 0x000000000001537e: jal ra,0x3806a 0x0000000000015382: auipc a5,0x8a 0x0000000000015386: addi a5,a5,1342 # 0x9f8c0 0x000000000001538a: ld a4,0(a5) 0x000000000001538c: beqz a4,0x153f0 0x000000000001538e: jal ra,0x38abe 0x0000000000015392: ld a0,0(s6) 0x0000000000015396: auipc s7,0x85 0x000000000001539a: ld s7,-406(s7) # 0x9a200 End of assembler dump. (gdb) (gdb) stepi 0x000000000001537e in ?? () (gdb)
Program received signal SIGILL, Illegal instruction. 0x00000000000385d4 in ?? () (gdb) Did you try to compare the call trace to QEMU, where we divert? Yes, that's possible way forward, but it will require some considerable setup on my side.
If you have QEMU ready... objdump tells you ldconfig's entrypoint, from that point you can just stepi. In less than 200 steps, you should have sigill... and complete steps that lead to it.
I've updated sid-ports (dropped the snapshot pinning), and now I'm getting a page fault on the instruction before the one that was causing SIGILL before:
[ 558.490689] CPU: 0 PID: 3212 Comm: ldconfig Not tainted 5.10.83-cip1- riscv-renesas #1
[ 558.490697] epc: 00000000000380c6 ra : 0000000000015382 sp : 0000003fff9e3c10
[ 558.490703] gp : 0000000000099da8 tp : 0000003fe9c3c800 t0 : 0000003fe9c427c0
[ 558.490710] t1 : 0000003fe9cd059c t2 : 0000002acb8f2c00 s0 : 0000002b079e9510
[ 558.490716] s1 : 0000000000000001 a0 : 0000003fff9e3d18 a1 : 0000000000000001
[ 558.490722] a2 : 0000003fff9e3c88 a3 : 0000000000000000 a4 : 0000003fff9e3d18
[ 558.490728] a5 : 000000000009736e a6 : 0000003fff9e3c80 a7 : 00000000000000dd
[ 558.490734] s2 : 0000003fff9e3c88 s3 : 0000000000000000 s4 : 0000000000000000
[ 558.490740] s5 : 00000000000105a4 s6 : 000000000009e670 s7 : 0000002b079c8ab0
[ 558.490746] s8 : 0000002b079e91c0 s9 : 0000000000000000 s10: 0000002acb8fc9b0
[ 558.490752] s11: 0000002acb8fc920 t3 : 0000002acb80f5d8 t4 : 000000000009259c
[ 558.490758] t5 : 0000000000000004 t6 : 0000002b0799c010 [ 558.490764] status: 0000000200004020 badaddr: 00000000000000e1 cause: 000000000000000f
(gdb) disassemble 0x00000000000380c6,+0x10 Dump of assembler code from 0x380c6 to 0x380d6: 0x00000000000380c6: addi sp,sp,-416 0x00000000000380c8: auipc a2,0x66 0x00000000000380cc: addi a2,a2,2000 # 0x9e898 0x00000000000380d0: sd a0,0(a2) 0x00000000000380d2: mv a5,sp 0x00000000000380d4: addi a4,sp,416 End of assembler dump.
I've stepped this through under qemu as well, and the control flow is identical. Registers are almost the same, except for some temporaries:
--- regs-qemu +++ regs-rzfive @@ -2,9 +2,9 @@ ra 0x15382 0x15382 sp 0x3ffffffbe0 0x3ffffffbe0 gp 0x99da8 0x99da8 -tp 0x3ff7e77800 0x3ff7e77800 -t0 0x3ff7e7d7c0 274742106048 -t1 0x3ff7f0b59c 274742687132 +tp 0x3ff7e78800 0x3ff7e78800 +t0 0x3ff7e7e7c0 274742110144 +t1 0x3ff7f0c59c 274742691228 t2 0x2aaab92c00 183252888576 fp 0x2aaabaee00 0x2aaabaee00 s1 0x1 1
No idea if that is normal (different machines, different memory sizes and layouts) or a symptom of the problem.
... OpenEmbedded nodistro.0 smarc-rzfive ttySC0
[ 12.829622] audit: type=1006 audit(1653987107.735:2): pid=156 uid=0 old- auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=1 res=1 root@smarc-rzfive:~# ldconfig [ 22.278868] ldconfig[166]: unhandled signal 11 code 0x1 at 0x0000000000000088 in ldconfig[10000+68000] [ 22.290244] CPU: 0 PID: 166 Comm: ldconfig Not tainted 5.10.83-cip1-riscv- renesas #1 [ 22.298954] epc: 0000000000030eea ra : 00000000000145a0 sp : 0000003fff9f8aa0 [ 22.306906] gp : 000000000007fe48 tp : 0000003fd958b720 t0 : 0000000000000000 [ 22.314973] t1 : 0000002adf9c3bbc t2 : 00000000000003ff s0 : 0000003fff9f8c90 [ 22.322986] s1 : 0000000000014b0e a0 : 0000003fff9f8c98 a1 : 0000000000000000 [ 22.330967] a2 : 0000003fff9f8be8 a3 : 0000000000014a86 a4 : 000000000007e576 [ 22.338936] a5 : 0000000000000000 a6 : 0000003fff9f8be0 a7 : 0000000000000000 [ 22.346897] s2 : 0000000000000000 s3 : 0000003fd96df918 s4 : ffffffffffffffff [ 22.354905] s5 : 0000002b01953f70 s6 : 0000002b01953c60 s7 : 0000002b019539b0 [ 22.362875] s8 : 0000002b01953b50 s9 : 0000000000000000 s10: 0000002adfa74584 [ 22.370884] s11: 0000000000000000 t3 : 0000003fd960ee18 t4 : 000000000000000f [ 22.378945] t5 : 000000000000000f t6 : 0000000000000000 [ 22.385051] status: 8000000200004020 badaddr: 0000000000000088 cause: 000000000000000d [ 22.393860] audit: type=1701 audit(1653987117.299:3): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=166 comm="ldconfig" exe="/sbin/ldconfig" sig=11 res=1 Segmentation fault
That was the version I found on eMMC.
I think you have some real homework now... Thanks, we'll take a look. Kind regards, Chris Jan
-- Siemens AG, Technology Competence Center Embedded Linux
|
|
On 09.10.22 10:42, Biju Das wrote: Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing isar-cip-core for RZ/Five
On 08.10.22 10:27, Jan Kiszka wrote:
On 07.10.22 12:19, Pavel Machek wrote:
Hi!
I tried, but installation fails - illegal instruction. Yeah, ldconfig is needed for installation. But I get a
segfaulting
gcc binary. It crashes rather soon after startup, so I was able to trace complete path.
But I do have slightly different results then you (I think; I'm far from risc-v expert). I did a breakpoint:
Breakpoint 1, 0x00000000000385d4 in ?? () I believe it should not end at 0x00000000000385d4 at all. The 0x000000000001537e jal instruction should end up calling 0x3806a AFAICT, but it calls 0x385d4 instead. It happens during single-stepping, so it should not be anything subtle.
(gdb) disassemble $pc,+0x20 Dump of assembler code from 0x1537c to 0x1539c: => 0x000000000001537c: mv a0,a4 0x000000000001537e: jal ra,0x3806a 0x0000000000015382: auipc a5,0x8a 0x0000000000015386: addi a5,a5,1342 # 0x9f8c0 0x000000000001538a: ld a4,0(a5) 0x000000000001538c: beqz a4,0x153f0 0x000000000001538e: jal ra,0x38abe 0x0000000000015392: ld a0,0(s6) 0x0000000000015396: auipc s7,0x85 0x000000000001539a: ld s7,-406(s7) # 0x9a200 End of assembler dump. (gdb) (gdb) stepi 0x000000000001537e in ?? () (gdb)
Program received signal SIGILL, Illegal instruction. 0x00000000000385d4 in ?? () (gdb) Did you try to compare the call trace to QEMU, where we divert? Yes, that's possible way forward, but it will require some considerable setup on my side.
If you have QEMU ready... objdump tells you ldconfig's entrypoint, from that point you can just stepi. In less than 200 steps, you should have sigill... and complete steps that lead to it.
I've updated sid-ports (dropped the snapshot pinning), and now I'm getting a page fault on the instruction before the one that was causing SIGILL before:
[ 558.490689] CPU: 0 PID: 3212 Comm: ldconfig Not tainted 5.10.83-cip1-riscv-renesas #1 [ 558.490697] epc: 00000000000380c6 ra
: 0000000000015382 sp : 0000003fff9e3c10 [ 558.490703] gp : 0000000000099da8 tp : 0000003fe9c3c800 t0 : 0000003fe9c427c0 [ 558.490710] t1 : 0000003fe9cd059c t2 : 0000002acb8f2c00 s0 : 0000002b079e9510 [ 558.490716] s1 : 0000000000000001 a0 : 0000003fff9e3d18 a1 : 0000000000000001 [ 558.490722] a2 : 0000003fff9e3c88 a3 : 0000000000000000 a4 : 0000003fff9e3d18 [ 558.490728] a5 : 000000000009736e a6 : 0000003fff9e3c80 a7 : 00000000000000dd [ 558.490734] s2 : 0000003fff9e3c88 s3 : 0000000000000000 s4 : 0000000000000000 [ 558.490740] s5 : 00000000000105a4 s6 : 000000000009e670 s7 : 0000002b079c8ab0 [ 558.490746] s8 : 0000002b079e91c0 s9 : 0000000000000000 s10: 0000002acb8fc9b0 [ 558.490752] s11: 0000002acb8fc920 t3 : 0000002acb80f5d8 t4 : 000000000009259c [ 558.490758] t5 : 0000000000000004 t6 : 0000002b0799c010 [ 558.490764] status: 0000000200004020 badaddr: 00000000000000e1 cause: 000000000000000f
(gdb) disassemble 0x00000000000380c6,+0x10 Dump of assembler code from
0x380c6 to 0x380d6: 0x00000000000380c6: addi sp,sp,-416 0x00000000000380c8: auipc a2,0x66 0x00000000000380cc: addi a2,a2,2000 # 0x9e898 0x00000000000380d0: sd a0,0(a2) 0x00000000000380d2: mv a5,sp 0x00000000000380d4: addi a4,sp,416 End of assembler dump.
I've stepped this through under qemu as well, and the control flow is
identical. Registers are almost the same, except for some temporaries:
--- regs-qemu +++ regs-rzfive @@ -2,9 +2,9 @@ ra 0x15382 0x15382 sp 0x3ffffffbe0 0x3ffffffbe0 gp 0x99da8 0x99da8 -tp 0x3ff7e77800 0x3ff7e77800 -t0 0x3ff7e7d7c0 274742106048 -t1 0x3ff7f0b59c 274742687132 +tp 0x3ff7e78800 0x3ff7e78800 +t0 0x3ff7e7e7c0 274742110144 +t1 0x3ff7f0c59c 274742691228 t2 0x2aaab92c00 183252888576 fp 0x2aaabaee00 0x2aaabaee00 s1 0x1 1
No idea if that is normal (different machines, different memory sizes
and layouts) or a symptom of the problem.
... OpenEmbedded nodistro.0 smarc-rzfive ttySC0
[ 12.829622] audit: type=1006 audit(1653987107.735:2): pid=156 uid=0 old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=1 res=1 root@smarc-rzfive:~# ldconfig [ 22.278868] ldconfig[166]: unhandled signal 11 code 0x1 at 0x0000000000000088 in ldconfig[10000+68000] [ 22.290244] CPU: 0 PID: 166 Comm: ldconfig Not tainted 5.10.83- cip1-riscv-renesas #1 [ 22.298954] epc: 0000000000030eea ra : 00000000000145a0 sp : 0000003fff9f8aa0 [ 22.306906] gp : 000000000007fe48 tp : 0000003fd958b720 t0 : 0000000000000000 [ 22.314973] t1 : 0000002adf9c3bbc t2 : 00000000000003ff s0 : 0000003fff9f8c90 [ 22.322986] s1 : 0000000000014b0e a0 : 0000003fff9f8c98 a1 : 0000000000000000 [ 22.330967] a2 : 0000003fff9f8be8 a3 : 0000000000014a86 a4 : 000000000007e576 [ 22.338936] a5 : 0000000000000000 a6 : 0000003fff9f8be0 a7 : 0000000000000000 [ 22.346897] s2 : 0000000000000000 s3 : 0000003fd96df918 s4 : ffffffffffffffff [ 22.354905] s5 : 0000002b01953f70 s6 : 0000002b01953c60 s7 : 0000002b019539b0 [ 22.362875] s8 : 0000002b01953b50 s9 : 0000000000000000 s10: 0000002adfa74584 [ 22.370884] s11: 0000000000000000 t3 : 0000003fd960ee18 t4 : 000000000000000f [ 22.378945] t5 : 000000000000000f t6 : 0000000000000000 [ 22.385051] status: 8000000200004020 badaddr: 0000000000000088 cause: 000000000000000d [ 22.393860] audit: type=1701 audit(1653987117.299:3): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=166 comm="ldconfig" exe="/sbin/ldconfig" sig=11 res=1 Segmentation fault
That was the version I found on eMMC.
I think you have some real homework now... What is your conclusion? Is it tool chain related issue? Or cache related issue?
Or
Something else ?
I have no idea and still only limited knowledge about the arch and this SoC. We can just rule out by now that the issue is Debian-exclusive. Jan -- Siemens AG, Technology Competence Center Embedded Linux
|
|
Subject: Re: RE: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing isar-cip-core for RZ/Five
On 09.10.22 10:42, Biju Das wrote:
Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing
isar-cip-core for RZ/Five
On 08.10.22 10:27, Jan Kiszka wrote:
On 07.10.22 12:19, Pavel Machek wrote:
Hi!
I tried, but installation fails - illegal instruction. Yeah, ldconfig is needed for installation. But I get a
segfaulting
gcc binary. It crashes rather soon after startup, so I was able to trace complete path.
But I do have slightly different results then you (I think;
I'm
far from risc-v expert). I did a breakpoint:
Breakpoint 1, 0x00000000000385d4 in ?? () I believe it should not end at 0x00000000000385d4 at all. The 0x000000000001537e jal instruction should end up calling
0x3806a
AFAICT, but it calls 0x385d4 instead. It happens during single-stepping, so it should not be anything subtle.
(gdb) disassemble $pc,+0x20 Dump of assembler code from 0x1537c to 0x1539c: => 0x000000000001537c: mv a0,a4 0x000000000001537e: jal ra,0x3806a 0x0000000000015382: auipc a5,0x8a 0x0000000000015386: addi a5,a5,1342 # 0x9f8c0 0x000000000001538a: ld a4,0(a5) 0x000000000001538c: beqz a4,0x153f0 0x000000000001538e: jal ra,0x38abe 0x0000000000015392: ld a0,0(s6) 0x0000000000015396: auipc s7,0x85 0x000000000001539a: ld s7,-406(s7) # 0x9a200 End of assembler dump. (gdb) (gdb) stepi 0x000000000001537e in ?? () (gdb)
Program received signal SIGILL, Illegal instruction. 0x00000000000385d4 in ?? () (gdb) Did you try to compare the call trace to QEMU, where we divert? Yes, that's possible way forward, but it will require some considerable setup on my side.
If you have QEMU ready... objdump tells you ldconfig's
entrypoint,
from that point you can just stepi. In less than 200 steps, you should have sigill... and complete steps that lead to it.
I've updated sid-ports (dropped the snapshot pinning), and now I'm getting a page fault on the instruction before the one that was causing SIGILL before:
[ 558.490689] CPU: 0 PID: 3212 Comm: ldconfig Not tainted 5.10.83-cip1-riscv-renesas #1 [ 558.490697] epc: 00000000000380c6 ra
: 0000000000015382 sp : 0000003fff9e3c10 [ 558.490703] gp : 0000000000099da8 tp : 0000003fe9c3c800 t0 : 0000003fe9c427c0 [ 558.490710] t1 : 0000003fe9cd059c t2 : 0000002acb8f2c00 s0 : 0000002b079e9510 [ 558.490716] s1 : 0000000000000001 a0 : 0000003fff9e3d18 a1 : 0000000000000001 [ 558.490722] a2 : 0000003fff9e3c88 a3 : 0000000000000000 a4 : 0000003fff9e3d18 [ 558.490728] a5 : 000000000009736e a6 : 0000003fff9e3c80 a7 : 00000000000000dd [ 558.490734] s2 : 0000003fff9e3c88 s3 : 0000000000000000 s4 : 0000000000000000 [ 558.490740] s5 : 00000000000105a4 s6 : 000000000009e670 s7 : 0000002b079c8ab0 [ 558.490746] s8 : 0000002b079e91c0 s9 : 0000000000000000 s10: 0000002acb8fc9b0 [ 558.490752] s11: 0000002acb8fc920 t3 : 0000002acb80f5d8 t4 : 000000000009259c [ 558.490758] t5 : 0000000000000004 t6 : 0000002b0799c010 [ 558.490764] status: 0000000200004020 badaddr: 00000000000000e1 cause: 000000000000000f
(gdb) disassemble 0x00000000000380c6,+0x10 Dump of assembler code from
0x380c6 to 0x380d6: 0x00000000000380c6: addi sp,sp,-416 0x00000000000380c8: auipc a2,0x66 0x00000000000380cc: addi a2,a2,2000 # 0x9e898 0x00000000000380d0: sd a0,0(a2) 0x00000000000380d2: mv a5,sp 0x00000000000380d4: addi a4,sp,416 End of assembler dump.
I've stepped this through under qemu as well, and the control flow is
identical. Registers are almost the same, except for some temporaries:
--- regs-qemu +++ regs-rzfive @@ -2,9 +2,9 @@ ra 0x15382 0x15382 sp 0x3ffffffbe0 0x3ffffffbe0 gp 0x99da8 0x99da8 -tp 0x3ff7e77800 0x3ff7e77800 -t0 0x3ff7e7d7c0 274742106048 -t1 0x3ff7f0b59c 274742687132 +tp 0x3ff7e78800 0x3ff7e78800 +t0 0x3ff7e7e7c0 274742110144 +t1 0x3ff7f0c59c 274742691228 t2 0x2aaab92c00 183252888576 fp 0x2aaabaee00 0x2aaabaee00 s1 0x1 1
No idea if that is normal (different machines, different memory sizes
and layouts) or a symptom of the problem.
... OpenEmbedded nodistro.0 smarc-rzfive ttySC0
[ 12.829622] audit: type=1006 audit(1653987107.735:2): pid=156
uid=0
old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=1 res=1
root@smarc-rzfive:~# ldconfig [ 22.278868] ldconfig[166]: unhandled signal 11 code 0x1 at 0x0000000000000088 in ldconfig[10000+68000] [ 22.290244] CPU: 0 PID: 166 Comm: ldconfig Not tainted 5.10.83- cip1-riscv-renesas #1 [ 22.298954] epc: 0000000000030eea ra : 00000000000145a0 sp : 0000003fff9f8aa0 [ 22.306906] gp : 000000000007fe48 tp : 0000003fd958b720 t0 : 0000000000000000 [ 22.314973] t1 : 0000002adf9c3bbc t2 : 00000000000003ff s0 : 0000003fff9f8c90 [ 22.322986] s1 : 0000000000014b0e a0 : 0000003fff9f8c98 a1 : 0000000000000000 [ 22.330967] a2 : 0000003fff9f8be8 a3 : 0000000000014a86 a4 : 000000000007e576 [ 22.338936] a5 : 0000000000000000 a6 : 0000003fff9f8be0 a7 : 0000000000000000 [ 22.346897] s2 : 0000000000000000 s3 : 0000003fd96df918 s4 : ffffffffffffffff [ 22.354905] s5 : 0000002b01953f70 s6 : 0000002b01953c60 s7 : 0000002b019539b0 [ 22.362875] s8 : 0000002b01953b50 s9 : 0000000000000000 s10: 0000002adfa74584 [ 22.370884] s11: 0000000000000000 t3 : 0000003fd960ee18 t4 : 000000000000000f [ 22.378945] t5 : 000000000000000f t6 : 0000000000000000 [ 22.385051] status: 8000000200004020 badaddr: 0000000000000088 cause: 000000000000000d [ 22.393860] audit: type=1701 audit(1653987117.299:3): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=166 comm="ldconfig" exe="/sbin/ldconfig" sig=11 res=1 Segmentation fault
That was the version I found on eMMC.
I think you have some real homework now... What is your conclusion? Is it tool chain related issue? Or cache related issue?
Or
Something else ?
I have no idea and still only limited knowledge about the arch and this SoC. We can just rule out by now that the issue is Debian- exclusive. Thanks for your feedback. Cheers, Biju
|
|
On 11.10.22 12:34, Biju Das via lists.cip-project.org wrote: Subject: Re: RE: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing isar-cip-core for RZ/Five
On 09.10.22 10:42, Biju Das wrote:
Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing
isar-cip-core for RZ/Five
On 08.10.22 10:27, Jan Kiszka wrote:
On 07.10.22 12:19, Pavel Machek wrote:
Hi!
I tried, but installation fails - illegal instruction. Yeah, ldconfig is needed for installation. But I get a
segfaulting
gcc binary. It crashes rather soon after startup, so I was able to trace complete path.
But I do have slightly different results then you (I think;
I'm
far from risc-v expert). I did a breakpoint:
Breakpoint 1, 0x00000000000385d4 in ?? () I believe it should not end at 0x00000000000385d4 at all. The 0x000000000001537e jal instruction should end up calling
0x3806a
AFAICT, but it calls 0x385d4 instead. It happens during single-stepping, so it should not be anything subtle.
(gdb) disassemble $pc,+0x20 Dump of assembler code from 0x1537c to 0x1539c: => 0x000000000001537c: mv a0,a4 0x000000000001537e: jal ra,0x3806a 0x0000000000015382: auipc a5,0x8a 0x0000000000015386: addi a5,a5,1342 # 0x9f8c0 0x000000000001538a: ld a4,0(a5) 0x000000000001538c: beqz a4,0x153f0 0x000000000001538e: jal ra,0x38abe 0x0000000000015392: ld a0,0(s6) 0x0000000000015396: auipc s7,0x85 0x000000000001539a: ld s7,-406(s7) # 0x9a200 End of assembler dump. (gdb) (gdb) stepi 0x000000000001537e in ?? () (gdb)
Program received signal SIGILL, Illegal instruction. 0x00000000000385d4 in ?? () (gdb) Did you try to compare the call trace to QEMU, where we divert? Yes, that's possible way forward, but it will require some considerable setup on my side.
If you have QEMU ready... objdump tells you ldconfig's
entrypoint,
from that point you can just stepi. In less than 200 steps, you should have sigill... and complete steps that lead to it.
I've updated sid-ports (dropped the snapshot pinning), and now I'm getting a page fault on the instruction before the one that was causing SIGILL before:
In case the requested page is a page with PROT_WRITE only (no PROT_READ) it might be related to https://lore.kernel.org/linux-riscv/20220915193702.2201018-1-abrestic@rivosinc.com/AFAIR all stable branches have that problem currently. [ 558.490689] CPU: 0 PID: 3212 Comm: ldconfig Not tainted 5.10.83-cip1-riscv-renesas #1 [ 558.490697] epc: 00000000000380c6 ra
: 0000000000015382 sp : 0000003fff9e3c10 [ 558.490703] gp : 0000000000099da8 tp : 0000003fe9c3c800 t0 : 0000003fe9c427c0 [ 558.490710] t1 : 0000003fe9cd059c t2 : 0000002acb8f2c00 s0 : 0000002b079e9510 [ 558.490716] s1 : 0000000000000001 a0 : 0000003fff9e3d18 a1 : 0000000000000001 [ 558.490722] a2 : 0000003fff9e3c88 a3 : 0000000000000000 a4 : 0000003fff9e3d18 [ 558.490728] a5 : 000000000009736e a6 : 0000003fff9e3c80 a7 : 00000000000000dd [ 558.490734] s2 : 0000003fff9e3c88 s3 : 0000000000000000 s4 : 0000000000000000 [ 558.490740] s5 : 00000000000105a4 s6 : 000000000009e670 s7 : 0000002b079c8ab0 [ 558.490746] s8 : 0000002b079e91c0 s9 : 0000000000000000 s10: 0000002acb8fc9b0 [ 558.490752] s11: 0000002acb8fc920 t3 : 0000002acb80f5d8 t4 : 000000000009259c [ 558.490758] t5 : 0000000000000004 t6 : 0000002b0799c010 [ 558.490764] status: 0000000200004020 badaddr: 00000000000000e1 cause: 000000000000000f
(gdb) disassemble 0x00000000000380c6,+0x10 Dump of assembler code from
0x380c6 to 0x380d6: 0x00000000000380c6: addi sp,sp,-416 0x00000000000380c8: auipc a2,0x66 0x00000000000380cc: addi a2,a2,2000 # 0x9e898 0x00000000000380d0: sd a0,0(a2) 0x00000000000380d2: mv a5,sp 0x00000000000380d4: addi a4,sp,416 End of assembler dump.
I've stepped this through under qemu as well, and the control flow is
identical. Registers are almost the same, except for some temporaries:
--- regs-qemu +++ regs-rzfive @@ -2,9 +2,9 @@ ra 0x15382 0x15382 sp 0x3ffffffbe0 0x3ffffffbe0 gp 0x99da8 0x99da8 -tp 0x3ff7e77800 0x3ff7e77800 -t0 0x3ff7e7d7c0 274742106048 -t1 0x3ff7f0b59c 274742687132 +tp 0x3ff7e78800 0x3ff7e78800 +t0 0x3ff7e7e7c0 274742110144 +t1 0x3ff7f0c59c 274742691228 t2 0x2aaab92c00 183252888576 fp 0x2aaabaee00 0x2aaabaee00 s1 0x1 1
No idea if that is normal (different machines, different memory sizes
and layouts) or a symptom of the problem.
... OpenEmbedded nodistro.0 smarc-rzfive ttySC0
[ 12.829622] audit: type=1006 audit(1653987107.735:2): pid=156
uid=0
old-auid=4294967295 auid=0 tty=(none) old-ses=4294967295 ses=1 res=1
root@smarc-rzfive:~# ldconfig [ 22.278868] ldconfig[166]: unhandled signal 11 code 0x1 at 0x0000000000000088 in ldconfig[10000+68000] [ 22.290244] CPU: 0 PID: 166 Comm: ldconfig Not tainted 5.10.83- cip1-riscv-renesas #1 [ 22.298954] epc: 0000000000030eea ra : 00000000000145a0 sp : 0000003fff9f8aa0 [ 22.306906] gp : 000000000007fe48 tp : 0000003fd958b720 t0 : 0000000000000000 [ 22.314973] t1 : 0000002adf9c3bbc t2 : 00000000000003ff s0 : 0000003fff9f8c90 [ 22.322986] s1 : 0000000000014b0e a0 : 0000003fff9f8c98 a1 : 0000000000000000 [ 22.330967] a2 : 0000003fff9f8be8 a3 : 0000000000014a86 a4 : 000000000007e576 [ 22.338936] a5 : 0000000000000000 a6 : 0000003fff9f8be0 a7 : 0000000000000000 [ 22.346897] s2 : 0000000000000000 s3 : 0000003fd96df918 s4 : ffffffffffffffff [ 22.354905] s5 : 0000002b01953f70 s6 : 0000002b01953c60 s7 : 0000002b019539b0 [ 22.362875] s8 : 0000002b01953b50 s9 : 0000000000000000 s10: 0000002adfa74584 [ 22.370884] s11: 0000000000000000 t3 : 0000003fd960ee18 t4 : 000000000000000f [ 22.378945] t5 : 000000000000000f t6 : 0000000000000000 [ 22.385051] status: 8000000200004020 badaddr: 0000000000000088 cause: 000000000000000d [ 22.393860] audit: type=1701 audit(1653987117.299:3): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=166 comm="ldconfig" exe="/sbin/ldconfig" sig=11 res=1 Segmentation fault
That was the version I found on eMMC.
I think you have some real homework now... What is your conclusion? Is it tool chain related issue? Or cache related issue?
Or
Something else ?
I have no idea and still only limited knowledge about the arch and this SoC. We can just rule out by now that the issue is Debian- exclusive. Thanks for your feedback.
Cheers, Biju
|
|
On 11.10.22 20:51, Florian Bezdeka wrote: On 11.10.22 12:34, Biju Das via lists.cip-project.org wrote:
Subject: Re: RE: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing isar-cip-core for RZ/Five
On 09.10.22 10:42, Biju Das wrote:
Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing
isar-cip-core for RZ/Five
On 08.10.22 10:27, Jan Kiszka wrote:
On 07.10.22 12:19, Pavel Machek wrote:
Hi!
I tried, but installation fails - illegal instruction. Yeah, ldconfig is needed for installation. But I get a
segfaulting
gcc binary. It crashes rather soon after startup, so I was able to trace complete path.
But I do have slightly different results then you (I think;
I'm
far from risc-v expert). I did a breakpoint:
Breakpoint 1, 0x00000000000385d4 in ?? () I believe it should not end at 0x00000000000385d4 at all. The 0x000000000001537e jal instruction should end up calling
0x3806a
AFAICT, but it calls 0x385d4 instead. It happens during single-stepping, so it should not be anything subtle.
(gdb) disassemble $pc,+0x20 Dump of assembler code from 0x1537c to 0x1539c: => 0x000000000001537c: mv a0,a4 0x000000000001537e: jal ra,0x3806a 0x0000000000015382: auipc a5,0x8a 0x0000000000015386: addi a5,a5,1342 # 0x9f8c0 0x000000000001538a: ld a4,0(a5) 0x000000000001538c: beqz a4,0x153f0 0x000000000001538e: jal ra,0x38abe 0x0000000000015392: ld a0,0(s6) 0x0000000000015396: auipc s7,0x85 0x000000000001539a: ld s7,-406(s7) # 0x9a200 End of assembler dump. (gdb) (gdb) stepi 0x000000000001537e in ?? () (gdb)
Program received signal SIGILL, Illegal instruction. 0x00000000000385d4 in ?? () (gdb) Did you try to compare the call trace to QEMU, where we divert? Yes, that's possible way forward, but it will require some considerable setup on my side.
If you have QEMU ready... objdump tells you ldconfig's
entrypoint,
from that point you can just stepi. In less than 200 steps, you should have sigill... and complete steps that lead to it.
I've updated sid-ports (dropped the snapshot pinning), and now I'm getting a page fault on the instruction before the one that was causing SIGILL before:
In case the requested page is a page with PROT_WRITE only (no PROT_READ) it might be related to https://lore.kernel.org/linux-riscv/20220915193702.2201018-1-abrestic@rivosinc.com/
AFAIR all stable branches have that problem currently.
Nice idea. I quickly hacked that on top of the rzfive kernel, but it didn't change the picture, unfortunately. That said, being able to test linus/master would be very valuable here. Jan -- Siemens AG, Technology Competence Center Embedded Linux
|
|
Hi Jan,
toggle quoted message
Show quoted text
-----Original Message----- From: Jan Kiszka <jan.kiszka@...> Sent: 11 October 2022 21:15 To: Florian Bezdeka <florian.bezdeka@...>; cip-dev@...; Chris Paterson <Chris.Paterson2@...>; Prabhakar Mahadev Lad <prabhakar.mahadev-lad.rj@...>; Hung Tran <hung.tran.jy@...> Cc: Pavel Machek <pavel@...> Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing isar-cip-core for RZ/Five
On 11.10.22 20:51, Florian Bezdeka wrote:
On 11.10.22 12:34, Biju Das via lists.cip-project.org wrote:
Subject: Re: RE: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing isar-cip-core for RZ/Five
On 09.10.22 10:42, Biju Das wrote:
Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing
isar-cip-core for RZ/Five
On 08.10.22 10:27, Jan Kiszka wrote:
On 07.10.22 12:19, Pavel Machek wrote:
Hi!
I tried, but installation fails - illegal instruction. Yeah, ldconfig is needed for installation. But I get a
segfaulting
gcc binary. It crashes rather soon after startup, so I was able to trace complete path.
But I do have slightly different results then you (I think;
I'm
far from risc-v expert). I did a breakpoint:
Breakpoint 1, 0x00000000000385d4 in ?? () I believe it should not end at 0x00000000000385d4 at all. The 0x000000000001537e jal instruction should end up calling
0x3806a
AFAICT, but it calls 0x385d4 instead. It happens during single-stepping, so it should not be anything subtle.
(gdb) disassemble $pc,+0x20 Dump of assembler code from 0x1537c to 0x1539c: => 0x000000000001537c: mv a0,a4 0x000000000001537e: jal ra,0x3806a 0x0000000000015382: auipc a5,0x8a 0x0000000000015386: addi a5,a5,1342 # 0x9f8c0 0x000000000001538a: ld a4,0(a5) 0x000000000001538c: beqz a4,0x153f0 0x000000000001538e: jal ra,0x38abe 0x0000000000015392: ld a0,0(s6) 0x0000000000015396: auipc s7,0x85 0x000000000001539a: ld s7,-406(s7) # 0x9a200 End of assembler dump. (gdb) (gdb) stepi 0x000000000001537e in ?? () (gdb)
Program received signal SIGILL, Illegal instruction. 0x00000000000385d4 in ?? () (gdb) Did you try to compare the call trace to QEMU, where we divert? Yes, that's possible way forward, but it will require some considerable setup on my side.
If you have QEMU ready... objdump tells you ldconfig's
entrypoint,
from that point you can just stepi. In less than 200 steps, you should have sigill... and complete steps that lead to it.
I've updated sid-ports (dropped the snapshot pinning), and now I'm getting a page fault on the instruction before the one that was causing SIGILL before:
In case the requested page is a page with PROT_WRITE only (no PROT_READ) it might be related to https://jpn01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore .kernel.org%2Flinux-riscv%2F20220915193702.2201018-1-abrestic%40rivosi nc.com%2F&data=05%7C01%7Cprabhakar.mahadev-lad.rj%40bp.renesas.com %7C4efefe2d9ed148944efd08daabc55ab5%7C53d82571da1947e49cb4625a166a4a2a %7C0%7C0%7C638011161361337108%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C& ;sdata=msPQCy0siXTQOmhj7gAtCK1zSQChGNg%2B2KcmAhQvH4k%3D&reserved=0
AFAIR all stable branches have that problem currently.
Nice idea. I quickly hacked that on top of the rzfive kernel, but it didn't change the picture, unfortunately.
Thanks for the quick test. That said, being able to test linus/master would be very valuable here.
I will test this on top of v6.0 and update the results. Cheers, Prabhakar
|
|
Hi Jan,
toggle quoted message
Show quoted text
-----Original Message----- From: Prabhakar Mahadev Lad Sent: 11 October 2022 21:49 To: Jan Kiszka <jan.kiszka@...>; Florian Bezdeka <florian.bezdeka@...>; cip-dev@...; Chris Paterson <Chris.Paterson2@...>; Hung Tran <hung.tran.jy@...> Cc: Pavel Machek <pavel@...> Subject: RE: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing isar-cip-core for RZ/Five
Hi Jan,
-----Original Message----- From: Jan Kiszka <jan.kiszka@...> Sent: 11 October 2022 21:15 To: Florian Bezdeka <florian.bezdeka@...>; cip-dev@...; Chris Paterson <Chris.Paterson2@...>; Prabhakar Mahadev Lad <prabhakar.mahadev-lad.rj@...>; Hung Tran <hung.tran.jy@...> Cc: Pavel Machek <pavel@...> Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing
isar-cip-core for RZ/Five
On 11.10.22 20:51, Florian Bezdeka wrote:
On 11.10.22 12:34, Biju Das via lists.cip-project.org wrote:
Subject: Re: RE: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing isar-cip-core for RZ/Five
On 09.10.22 10:42, Biju Das wrote:
Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing
isar-cip-core for RZ/Five
On 08.10.22 10:27, Jan Kiszka wrote:
On 07.10.22 12:19, Pavel Machek wrote:
Hi!
I tried, but installation fails - illegal instruction. Yeah, ldconfig is needed for installation. But I get a
segfaulting
gcc binary. It crashes rather soon after startup, so I was able to
trace
complete path.
But I do have slightly different results then you (I
think;
I'm
far from risc-v expert). I did a breakpoint:
Breakpoint 1, 0x00000000000385d4 in ?? () I believe it should not end at 0x00000000000385d4 at all. The 0x000000000001537e jal instruction should end up
calling
0x3806a
AFAICT, but it calls 0x385d4 instead. It happens during single-stepping, so it should not be anything subtle.
(gdb) disassemble $pc,+0x20 Dump of assembler code from 0x1537c to 0x1539c: => 0x000000000001537c: mv a0,a4 0x000000000001537e: jal ra,0x3806a 0x0000000000015382: auipc a5,0x8a 0x0000000000015386: addi a5,a5,1342 # 0x9f8c0 0x000000000001538a: ld a4,0(a5) 0x000000000001538c: beqz a4,0x153f0 0x000000000001538e: jal ra,0x38abe 0x0000000000015392: ld a0,0(s6) 0x0000000000015396: auipc s7,0x85 0x000000000001539a: ld s7,-406(s7) # 0x9a200 End of assembler dump. (gdb) (gdb) stepi 0x000000000001537e in ?? () (gdb)
Program received signal SIGILL, Illegal instruction. 0x00000000000385d4 in ?? () (gdb) Did you try to compare the call trace to QEMU, where we
divert?
Yes, that's possible way forward, but it will require some considerable setup on my side.
If you have QEMU ready... objdump tells you ldconfig's
entrypoint,
from that point you can just stepi. In less than 200 steps, you should have sigill... and complete steps that lead to
it.
I've updated sid-ports (dropped the snapshot pinning), and
now
I'm getting a page fault on the instruction before the one
that
was causing SIGILL before:
In case the requested page is a page with PROT_WRITE only (no PROT_READ) it might be related to
https://jpn01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flo
re .kernel.org%2Flinux-riscv%2F20220915193702.2201018-1- abrestic%40rivo
si nc.com%2F&data=05%7C01%7Cprabhakar.mahadev- lad.rj%40bp.renesas.c
om
%7C4efefe2d9ed148944efd08daabc55ab5%7C53d82571da1947e49cb4625a166a4a
2a
%7C0%7C0%7C638011161361337108%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
Aw
MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&a
mp
;sdata=msPQCy0siXTQOmhj7gAtCK1zSQChGNg%2B2KcmAhQvH4k%3D&reserved
=0
AFAIR all stable branches have that problem currently.
Nice idea. I quickly hacked that on top of the rzfive kernel, but it didn't change the picture, unfortunately.
Thanks for the quick test.
That said, being able to test linus/master would be very valuable here. I will test this on top of v6.0 and update the results.
I did a quick test with the patches pointed by Florian but unfortunately ldconfig still fails. Cheers, Prabhakar
|
|
On 10/12/2022 11:50 AM CEST Lad Prabhakar <prabhakar.mahadev-lad.rj@...> wrote: I did a quick test with the patches pointed by Florian but unfortunately ldconfig still fails. I did some experiments on RZ/Five with this issue, and I'm almost positive that there is something wrong (or doesn't work as documented) with the icache handling on this SoC. 1. The issue only affects non-PIE executables (there are very few of those, basically just ldconfig, gcc, cpp and gcov* on the Debian system), and it occurs very early during the execution of the program. According to the datasheet, the cache on the ax45mp-1c core is virtually indexed, so it is unlikely that a PIE executable will ever hit anything in the cache when newly loaded, but it is much more likely with non-PIE executables. 2. Setting a breakpoint before the illegal/segfaulting instruction doesn't work, and what is executed is clearly not what we're seeing through the dcache (the offending instructions are neither illegal, nor are they able to cause segfaults), so instruction fetches must see something different. 3. Neither manually calling __vdso_flush_icache() from gdb (which executes a "fence.i" instruction) nor patching a "fence.i" into the ldconfig binary seem to do anything. According to the ax45mp-1c datasheet "fence.i" should flush the dcache and invalidate the icache. My educated guess is that, in spite of the claims in the core manual, the "fence.i" instruction is not implemented, or not implemented correctly. (The datasheet does acknowledge that "fence", without the ".i", is a nop.) The RISC-V ISA manual says that "fence.i" is part of the optional "Zifencei" extension, which I don't see mentioned in the core datasheet anywhere. (And at least at first glance, I couldn't find any other mechanism to invalidate the icache there either.) CU Uli
|
|
Hi! (Can I get you to wrap emails at ~72 columns or so?) I did a quick test with the patches pointed by Florian but unfortunately ldconfig still fails. I did some experiments on RZ/Five with this issue, and I'm almost positive that there is something wrong (or doesn't work as documented) with the icache handling on this SoC.
1. The issue only affects non-PIE executables (there are very few of those, basically just ldconfig, gcc, cpp and gcov* on the Debian system), and it occurs very early during the execution of the program. According to the datasheet, the cache on the ax45mp-1c core is virtually indexed, so it is unlikely that a PIE executable will ever hit anything in the cache when newly loaded, but it is much more likely with non-PIE executables. Ah, I was wondering what does gcc and ldconfig have in common... 2. Setting a breakpoint before the illegal/segfaulting instruction doesn't work, and what is executed is clearly not what we're seeing through the dcache (the offending instructions are neither illegal, nor are they able to cause segfaults), so instruction fetches must see something different. In my testing, I was able to stepi from the start, and then I was able to put breakpoint at preceding instruction (which was a jump). It looked like we jumped into the middle of instruction, which would explain the fault. Best regards, Pavel -- DENX Software Engineering GmbH, Managing Director: Wolfgang Denk HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
|
|
Hi! On 10/12/2022 11:50 AM CEST Lad Prabhakar <prabhakar.mahadev-lad.rj@...> wrote: I did a quick test with the patches pointed by Florian but unfortunately ldconfig still fails. I did some experiments on RZ/Five with this issue, and I'm almost positive that there is something wrong (or doesn't work as documented) with the icache handling on this SoC.
1. The issue only affects non-PIE executables (there are very few of those, basically just ldconfig, gcc, cpp and gcov* on the Debian system), and it occurs very early during the execution of the program. According to the datasheet, the cache on the ax45mp-1c core is virtually indexed, so it is unlikely that a PIE executable will ever hit anything in the cache when newly loaded, but it is much more likely with non-PIE executables.
This is very good observation. Thanks! And indeed it looks like _any_ non-PIE executable fails. See: root@smarc-rzfive:/my# cat mytest.c #include <stdio.h> void main(void) { printf("ahoj svete\n"); } root@smarc-rzfive:/my# clang mytest.c -fno-pie -static mytest.c:3:1: warning: return type of 'main' is not 'int' [-Wmain-return-type] void main(void) { printf("ahoj svete\n"); } ^ mytest.c:3:1: note: change return type to 'int' void main(void) { printf("ahoj svete\n"); } ^~~~ int 1 warning generated. root@smarc-rzfive:/my# ./a.out [ 279.010424] a.out[214]: unhandled signal 11 code 0x1 at 0xffffff8c38bd1524 (-O3 -g might be useful to add to clang command line). Then you can b _dl_discover_osversion run (gdb) disassemble /r Dump of assembler code for function _dl_discover_osversion: 0x000000000002538a <+0>: 41 71 addi sp,sp,-496 0x000000000002538c <+2>: a8 00 addi a0,sp,72 0x000000000002538e <+4>: 86 f7 sd ra,488(sp) 0x0000000000025390 <+6>: a2 f3 sd s0,480(sp) 0x0000000000025392 <+8>: a6 ef sd s1,472(sp) 0x0000000000025394 <+10>: ca eb sd s2,464(sp) => 0x0000000000025396 <+12>: ef 60 a1 5c jal ra,0x3b960 <uname> 0x000000000002539a <+16>: 93 05 a1 0c addi a1,sp,202 0x000000000002539e <+20>: 49 e5 bnez a0,0x25428 <_dl_discover_osversion+158> 0x00000000000253a0 <+22>: 81 48 li a7,0 0x00000000000253a2 <+24>: 01 45 li a0,0 0x00000000000253a4 <+26>: 25 48 li a6,9 0x00000000000253a6 <+28>: 13 03 e0 02 li t1,46 It clearly tries to call uname, which.. it should, according to the source code. But somehow it ends up in completely different function: (gdb) stepi Program received signal SIGILL, Illegal instruction. 0x000000000003b2fe in wcsrtombs () Best regards, Pavel -- DENX Software Engineering GmbH, Managing Director: Wolfgang Denk HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
|
|
Hi All,
toggle quoted message
Show quoted text
-----Original Message----- From: Pavel Machek <pavel@...> Sent: 13 October 2022 22:48 To: Ulrich Hecht <uli@...> Cc: cip-dev@...; Prabhakar Mahadev Lad <prabhakar.mahadev-lad.rj@...>; Jan Kiszka <jan.kiszka@...>; Florian Bezdeka <florian.bezdeka@...>; Chris Paterson <Chris.Paterson2@...>; Hung Tran <hung.tran.jy@...>; Pavel Machek <pavel@...> Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing isar-cip-core for RZ/Five
Hi!
On 10/12/2022 11:50 AM CEST Lad Prabhakar <prabhakar.mahadev-lad.rj@...> wrote: I did a quick test with the patches pointed by Florian but unfortunately ldconfig still fails. I did some experiments on RZ/Five with this issue, and I'm almost positive that there is something wrong (or doesn't work as documented) with the icache handling on this SoC.
1. The issue only affects non-PIE executables (there are very few of those, basically just ldconfig, gcc, cpp and gcov* on the Debian system), and it occurs very early during the execution of the program. According to the datasheet, the cache on the ax45mp-1c core is virtually indexed, so it is unlikely that a PIE executable will ever hit anything in the cache when newly loaded, but it is much more likely with non-PIE executables. This is very good observation. Thanks!
And indeed it looks like _any_ non-PIE executable fails. See:
Just a brief about the issue and solution: TEXT_START_ADDR is the start of text segment of an application. This is being set to 0x10000 for RISCV platforms. So when an application is compiled with the static flag the load would start from 0x10000 - xyz (depending on size of the application) Entry point 0x101c0 There are 5 program headers, starting at offset 64Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align LOAD 0x0000000000000000 0x0000000000010000 0x0000000000010000 0x0000000000059b48 0x0000000000059b48 R E 0x1000 LOAD 0x0000000000059b60 0x000000000006ab60 0x000000000006ab60 0x0000000000001f68 0x0000000000003528 RW 0x1000 So for the above application which is compiled statically we can see the entry point is 0x101c0 and load 0x0000000000010000. Andes cores have local memories ILM and DLM that are mapped in the region H'0_0003_0000 - H'0_0004_FFFF on the RZ/Five SoC. When the virtual address falls in this range the MMU doesnt trigger a page fault and assume the virtual address as physical address and hence the application fails to run (panics somewhere). So to avoid this issue we set the TEXT_START_ADDR to 0x50000 so that virtual address of any statically compiled application doesnt fall in the range of H'0_0003_0000 - H'0_0004_FFFF. Elf file type is EXEC (Executable file) Entry point 0x504e4 There are 5 program headers, starting at offset 64 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align LOAD 0x0000000000000000 0x0000000000050000 0x0000000000050000 0x0000000000057dc8 0x0000000000057dc8 R E 0x1000 LOAD 0x00000000000585b8 0x00000000000a95b8 0x00000000000a95b8 0x0000000000004ee0 0x00000000000064b0 RW 0x1000 NOTE 0x0000000000000158 0x0000000000050158 0x0000000000050158 0x0000000000000044 0x0000000000000044 R 0x4 So now with the fix for statically compiled application we can see its offsetted and entry point is 0x504e4 and load is at 0x0000000000050000. So with this we are for sure the MMU will always trigger a page fault. I have attached a patch for binutils to the email. We plan to upstream this patch to binutils soon. Cheers, Prabhakar
|
|
On 29.11.22 19:57, Prabhakar Mahadev Lad wrote: Hi All,
-----Original Message----- From: Pavel Machek <pavel@...> Sent: 13 October 2022 22:48 To: Ulrich Hecht <uli@...> Cc: cip-dev@...; Prabhakar Mahadev Lad <prabhakar.mahadev-lad.rj@...>; Jan Kiszka <jan.kiszka@...>; Florian Bezdeka <florian.bezdeka@...>; Chris Paterson <Chris.Paterson2@...>; Hung Tran <hung.tran.jy@...>; Pavel Machek <pavel@...> Subject: Re: [cip-dev] ldconfig segfault on RZ/Five was Re: Preparing isar-cip-core for RZ/Five
Hi!
On 10/12/2022 11:50 AM CEST Lad Prabhakar <prabhakar.mahadev-lad.rj@...> wrote: I did a quick test with the patches pointed by Florian but unfortunately ldconfig still fails. I did some experiments on RZ/Five with this issue, and I'm almost positive that there is something wrong (or doesn't work as documented) with the icache handling on this SoC.
1. The issue only affects non-PIE executables (there are very few of those, basically just ldconfig, gcc, cpp and gcov* on the Debian system), and it occurs very early during the execution of the program. According to the datasheet, the cache on the ax45mp-1c core is virtually indexed, so it is unlikely that a PIE executable will ever hit anything in the cache when newly loaded, but it is much more likely with non-PIE executables. This is very good observation. Thanks!
And indeed it looks like _any_ non-PIE executable fails. See:
Just a brief about the issue and solution:
TEXT_START_ADDR is the start of text segment of an application. This is being set to 0x10000 for RISCV platforms.
So when an application is compiled with the static flag the load would start from 0x10000 - xyz (depending on size of the application)
Entry point 0x101c0 There are 5 program headers, starting at offset 64Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align LOAD 0x0000000000000000 0x0000000000010000 0x0000000000010000 0x0000000000059b48 0x0000000000059b48 R E 0x1000 LOAD 0x0000000000059b60 0x000000000006ab60 0x000000000006ab60 0x0000000000001f68 0x0000000000003528 RW 0x1000 So for the above application which is compiled statically we can see the entry point is 0x101c0 and load 0x0000000000010000.
Andes cores have local memories ILM and DLM that are mapped in the region H'0_0003_0000 - H'0_0004_FFFF on the RZ/Five SoC. When the virtual address falls in this range the MMU doesnt trigger a page fault and assume the virtual address as physical address and hence the application fails to run (panics somewhere).
So to avoid this issue we set the TEXT_START_ADDR to 0x50000 so that virtual address of any statically compiled application doesnt fall in the range of H'0_0003_0000 - H'0_0004_FFFF.
Elf file type is EXEC (Executable file) Entry point 0x504e4 There are 5 program headers, starting at offset 64
Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align LOAD 0x0000000000000000 0x0000000000050000 0x0000000000050000 0x0000000000057dc8 0x0000000000057dc8 R E 0x1000 LOAD 0x00000000000585b8 0x00000000000a95b8 0x00000000000a95b8 0x0000000000004ee0 0x00000000000064b0 RW 0x1000 NOTE 0x0000000000000158 0x0000000000050158 0x0000000000050158 0x0000000000000044 0x0000000000000044 R 0x4
So now with the fix for statically compiled application we can see its offsetted and entry point is 0x504e4 and load is at 0x0000000000050000. So with this we are for sure the MMU will always trigger a page fault.
I have attached a patch for binutils to the email. We plan to upstream this patch to binutils soon.
Good that the issue is understood and likely solved now. Make sure to upstream this as quickly as possible. It targets a fundamental tool and requires recompilation of many components. And Debian will freeze the toolchain in early January - although: "It is unlikely that the release arch of bookworm will include riscv64." [1] :( Jan [1] https://lists.debian.org/debian-riscv/2022/12/msg00009.html-- Siemens AG, Technology Competence Center Embedded Linux
|
|
Hi! This is very good observation. Thanks!
And indeed it looks like _any_ non-PIE executable fails. See:
Just a brief about the issue and solution:
TEXT_START_ADDR is the start of text segment of an application. This is being set to 0x10000 for RISCV platforms.
So when an application is compiled with the static flag the load would start from 0x10000 - xyz (depending on size of the application)
Entry point 0x101c0 There are 5 program headers, starting at offset 64Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flags Align LOAD 0x0000000000000000 0x0000000000010000 0x0000000000010000 0x0000000000059b48 0x0000000000059b48 R E 0x1000 LOAD 0x0000000000059b60 0x000000000006ab60 0x000000000006ab60 0x0000000000001f68 0x0000000000003528 RW 0x1000 So for the above application which is compiled statically we can see the entry point is 0x101c0 and load 0x0000000000010000.
Andes cores have local memories ILM and DLM that are mapped in the
region H'0_0003_0000 - H'0_0004_FFFF on the RZ/Five SoC. When the virtual address falls in this range the MMU doesnt trigger a page fault and assume the virtual address as physical address and hence the application fails to run (panics somewhere).
... Good that the issue is understood and likely solved now. Make sure to upstream this as quickly as possible. It targets a fundamental tool and requires recompilation of many components. And Debian will freeze the toolchain in early January - although:
"It is unlikely that the release arch of bookworm will include riscv64." [1] :( I'm pretty sure this is not complete fix. Yes, we should change the toolchain, but the problem is really in the hardware: you can't just take part of _virtual_ address space and reserve it. Not if you want to claim board is riscv64 compatible. Someone else (manual mmap, some kind of JIT, some kind of emulator) might want normal RAM there. I believe this is quite important and should be solved in hardware (at least in next generation). Can ILM/DLM be disabled? If we can not fix it at hardware level, we'll really need to prevent attempts to map anything at that virtual memory range. Clear -EPERM from mmap is better than strange behaviour at runtime, and it is must-have from security perspective. Best regards, Pavel -- DENX Software Engineering GmbH, Managing Director: Wolfgang Denk HRB 165235 Munich, Office: Kirchenstr.5, D-82194 Groebenzell, Germany
|
|