January 28, 2017

Edgerouter Lite-3: a painful fiasco

For the new people coming here, let's give some background first about what I do in my spare time.

For two years now, I have been working on my own Linux distribution. Of course, it will not become the new Ubuntu son. That would be pointless. My primary goal is to target embedded hardware, in order to have a solid network infrastructure at home. It was an interesting challenge, so far. For a complete insight, I encourage you to visit the dedicated page (that needs to be updated, tho).

Finding hardware that suits me and my needs, always been a painful crusade. I do not want to be stuck in a door because of some poor designs or decisions (that is very common, VERY COMMONNNN among manufacturers). That's why I try to be careful. One day, I saw a good candidate: Ubiquiti Edgerouter Lite (Lite-3). As a guy that likes to work with MIPS architecture, seeing this Cavium Octeon+ which is a multi-core MIPS64 CPU, sounded kinda exciting. 3 ethernets, 512Mib RAM, yeah, not bad… The system is installed on an (warmy) USB key that can be removed and "reflashed" with your special OS. Humm, yummy. I thought it was the ultimate board I found. My theory was confirmed after reading all those blog posts explaining how awesome it is. My right hand and the mice decided to buy two units. Stupid boy, first mistake…

I will not describe the hardware. You just need to know that it does not feel clumsy. The case is sturdy enough to trick you. I would say it is the only thing that sounds good. Don't be naive, it was built in China.

Last weekend, the first ERLite successfully booted with 'solyste'. Let's say I was excited, that's why I posted two tweets, full of joy. I should calm down. First thing I noticed after a few minutes, was a delay when typing commands or navigating in shell history. Humm strange, as the ERLite is directly connected on another router. My desktop is also plugged on the same router. Even if it was a bit boring, I didn't look closer.

Soon after, I was also pissed because OpenNTPD was unable to sync this machine on two pools (fr.pool and europe.pool). That means 8 NTP servers… I now use this daemon for many weeks, on many different devices (with and without RTC), powered by different CPU architectures and it didn't fail 95% of the time. Here the output:

$ ntpctl -s all
0/8 peers valid, clock unsynced

peer
   wt tl st  next  poll          offset       delay      jitter
37.187.18.4 from pool fr.pool.ntp.org
	1  2  -   77s  300s             ---- peer not valid ----
151.80.19.218 from pool fr.pool.ntp.org
	1  2  -   80s  300s             ---- peer not valid ----
178.33.111.48 from pool fr.pool.ntp.org
	1  2  -   75s  300s             ---- peer not valid ----
194.177.34.115 from pool fr.pool.ntp.org
	1  2  3  229s  300s             ---- peer not valid ----
136.243.110.35 from pool europe.pool.ntp.org
	1  2  -   87s  300s             ---- peer not valid ----
139.112.153.36 from pool europe.pool.ntp.org
	1  2  -   99s  300s             ---- peer not valid ----
178.17.161.12 from pool europe.pool.ntp.org
	1  2  -   72s  300s             ---- peer not valid ----
213.136.0.252 from pool europe.pool.ntp.org
	1  2  -   90s  300s             ---- peer not valid ----

The logs were enough verbose to tell me what's wrong:

$ solog openntpd -f
no reply received in time, skipping initial time setting
no reply from 213.136.0.252 received in time, next query 300s
peer 213.136.0.252 now invalid
no reply from 139.112.153.36 received in time, next query 300s
peer 139.112.153.36 now invalid
no reply from 194.177.34.115 received in time, next query 300s
peer 194.177.34.115 now invalid

What can I fuckin' do with a machine that is unable to get synced with NTP? At this point, I was really jaded. The DNS resolution is correctly done. Trying to ping the destination seemed pretty OK:

$ ping 213.136.0.252
PING 213.136.0.252 (213.136.0.252): 56 data bytes
64 bytes from 213.136.0.252: seq=0 ttl=49 time=37.627 ms
64 bytes from 213.136.0.252: seq=1 ttl=49 time=37.752 ms
64 bytes from 213.136.0.252: seq=3 ttl=49 time=36.390 ms
64 bytes from 213.136.0.252: seq=6 ttl=49 time=36.862 ms
^C
--- 213.136.0.252 ping statistics ---
7 packets transmitted, 4 packets received, 42% packet loss
round-trip min/avg/max = 36.390/37.157/37.752 ms

HUM, WHAT?? Can you notice the problem? It can be done in the blink of an eye… 42% packet loss. Needless to say it is the true output. Missing seqs {a,we}re there for real. The packets to/from are just lost. That's why I receive no replies most of the time.

I thought something: maybe my kernel is just bad and lacks a required driver. That is very unlikely but let's dig deeper. I followed the old and deprecated page from OpenWrt wiki, to install that distribution on the key. I can assume their kernel is fine and includes everything needed for the ERLite. Before I continue about the results, please note that the tests were done using the serial interface. There were no network disturbances on the machine and the router. I also reproduced these results, many times.

Here the easy procedure: let's ping the gateway. I will not be impacted by internet lags, as I am only working on the LAN.

We are now running on OpenWrt 14.07 (Barrier Breaker):

$ uname -a
Linux OpenWrt 3.10.49 #3 SMP Wed Oct 1 16:27:17 CEST 2014 mips64 GNU/Linux

Results:

$ ping 192.168.5.1
PING 192.168.5.1 (192.168.5.1): 56 data bytes
64 bytes from 192.168.5.1: seq=0 ttl=64 time=0.603 ms
64 bytes from 192.168.5.1: seq=1 ttl=64 time=0.555 ms
64 bytes from 192.168.5.1: seq=4 ttl=64 time=0.538 ms
64 bytes from 192.168.5.1: seq=5 ttl=64 time=0.503 ms
64 bytes from 192.168.5.1: seq=7 ttl=64 time=0.544 ms
[...]
64 bytes from 192.168.5.1: seq=196 ttl=64 time=0.528 ms
64 bytes from 192.168.5.1: seq=197 ttl=64 time=0.516 ms
64 bytes from 192.168.5.1: seq=198 ttl=64 time=0.490 ms
64 bytes from 192.168.5.1: seq=201 ttl=64 time=0.536 ms
^C
--- 192.168.5.1 ping statistics ---
203 packets transmitted, 148 packets received, 27% packet loss
round-trip min/avg/max = 0.470/0.523/0.635 ms

Do I have to comment it? As my previous observation was confirmed, I decided to build four kernel versions: 4.1.38, 4.4.40, 4.4.44 and 4.9.5 with the stock cavium_octeon defconfig. I only turned off modules. Using the defconfig means I stick to default settings. The kernels size is around 12 or 13 Mib. I will only show you the output for the 4.4.40, because the same behavior applies on all other kernels:

$ uname -a
Linux solyste-erlite 4.4.40 #1 SMP PREEMPT Sun Jan 22 20:29:06 CET 2017 mips64 GNU/Linux

Results:

$ ping 192.168.5.1
PING 192.168.5.1 (192.168.5.1): 56 data bytes
64 bytes from 192.168.5.1: seq=0 ttl=64 time=0.614 ms
64 bytes from 192.168.5.1: seq=1 ttl=64 time=0.544 ms
64 bytes from 192.168.5.1: seq=2 ttl=64 time=0.544 ms
64 bytes from 192.168.5.1: seq=3 ttl=64 time=0.494 ms
64 bytes from 192.168.5.1: seq=5 ttl=64 time=0.582 ms
[...]
64 bytes from 192.168.5.1: seq=207 ttl=64 time=0.552 ms
64 bytes from 192.168.5.1: seq=208 ttl=64 time=0.542 ms
64 bytes from 192.168.5.1: seq=209 ttl=64 time=0.552 ms
64 bytes from 192.168.5.1: seq=210 ttl=64 time=0.533 ms
^C
--- 192.168.5.1 ping statistics ---
211 packets transmitted, 118 packets received, 44% packet loss
round-trip min/avg/max = 0.474/0.547/0.634 ms

Gniark, gniark. That is just INSANE. Now, you would think you have seen enough… No, lemme tell you it's false. The interfaces are ALSO flapping very often (on all eth*). OMG! It makes it even worse.

Jan 23 18:06:13 solyste-erlite kern.notice kernel: [  937.178879] eth0: Link down
Jan 23 18:06:14 solyste-erlite kern.notice kernel: [  938.178984] eth0: 1000 Mbps Full duplex, port 0, queue 0
Jan 23 18:40:43 solyste-erlite kern.notice kernel: [ 3007.166940] eth0: Link down
Jan 23 18:40:44 solyste-erlite kern.notice kernel: [ 3008.167049] eth0: 1000 Mbps Full duplex, port 0, queue 0
Jan 23 18:41:13 solyste-erlite kern.notice kernel: [ 3037.170935] eth0: Link down
Jan 23 18:41:14 solyste-erlite kern.notice kernel: [ 3038.171046] eth0: 1000 Mbps Full duplex, port 0, queue 0
Jan 23 18:41:23 solyste-erlite kern.notice kernel: [ 3047.174919] eth0: Link down
Jan 23 18:41:24 solyste-erlite kern.notice kernel: [ 3048.175037] eth0: 1000 Mbps Full duplex, port 0, queue 0
Jan 23 18:41:54 solyste-erlite kern.notice kernel: [ 3078.178934] eth0: Link down
Jan 23 18:41:55 solyste-erlite kern.notice kernel: [ 3079.179053] eth0: 1000 Mbps Full duplex, port 0, queue 0

We probably hit a serious bug. I can already see, hear or read the fanboys coming here and telling me: "Nothing new, you are using bloated Linux. Everybody knows it is awful. Use my very special BSD". No, boy. Stop telling me random things you just found inside your magical hat. Let's install OpenBSD-current!

$ uname -a
OpenBSD octeontest.hell 6.0 GENERIC#0 octeon

Results:

$ ping 192.168.5.1
PING 192.168.5.1 (192.168.5.1): 56 data bytes
64 bytes from 192.168.5.1: icmp_seq=1 ttl=64 time=0.644 ms
64 bytes from 192.168.5.1: icmp_seq=2 ttl=64 time=0.602 ms
64 bytes from 192.168.5.1: icmp_seq=4 ttl=64 time=0.573 ms
64 bytes from 192.168.5.1: icmp_seq=5 ttl=64 time=0.603 ms
64 bytes from 192.168.5.1: icmp_seq=6 ttl=64 time=0.606 ms
64 bytes from 192.168.5.1: icmp_seq=8 ttl=64 time=0.607 ms
[...]
64 bytes from 192.168.5.1: icmp_seq=208 ttl=64 time=0.584 ms
64 bytes from 192.168.5.1: icmp_seq=209 ttl=64 time=0.604 ms
64 bytes from 192.168.5.1: icmp_seq=210 ttl=64 time=0.609 ms
64 bytes from 192.168.5.1: icmp_seq=211 ttl=64 time=0.609 ms
^C
--- 192.168.5.1 ping statistics ---
212 packets transmitted, 144 packets received, 32.1% packet loss
round-trip min/avg/max/std-dev = 0.549/1.940/193.017/15.979 ms

The value is still very HIGH! Another euphemism to say it is unacceptable. You could ask if it's not my LAN that is a defective. Yeah, maybe. Let's apply the same procedure on a Raspberry Pi, that acts as my primary DNS resolver:

$ uname -a
Linux solyste-rpi 4.4.38-SOLYSTE #1 PREEMPT Sat Jan 14 10:03:03 CET 2017 armv6l GNU/Linux

Results:

$ ping 192.168.5.1
PING 192.168.5.1 (192.168.5.1): 56 data bytes
64 bytes from 192.168.5.1: seq=0 ttl=64 time=1.386 ms
64 bytes from 192.168.5.1: seq=1 ttl=64 time=0.818 ms
64 bytes from 192.168.5.1: seq=2 ttl=64 time=0.842 ms
64 bytes from 192.168.5.1: seq=3 ttl=64 time=0.817 ms
64 bytes from 192.168.5.1: seq=4 ttl=64 time=0.825 ms
64 bytes from 192.168.5.1: seq=5 ttl=64 time=0.814 ms
[...]
64 bytes from 192.168.5.1: seq=278 ttl=64 time=0.815 ms
64 bytes from 192.168.5.1: seq=279 ttl=64 time=0.796 ms
64 bytes from 192.168.5.1: seq=280 ttl=64 time=0.804 ms
64 bytes from 192.168.5.1: seq=281 ttl=64 time=0.799 ms
^C
--- 192.168.5.1 ping statistics ---
282 packets transmitted, 282 packets received, 0% packet loss
round-trip min/avg/max = 0.754/0.833/1.981 ms

Is there any question? This output is bloody shiny. We faced something nasty, indeed.

My brain had a kernel panic. After a few minutes and a brain reboot (with cheese), I told myself: "Dude, this hardware is clearly broken. You did what you could". Usually, I like to understand why problems occur. You know what? Let's apply a final test on the stock OS, that was backuped. The default system is called Ubiquiti EdgeOS. It is an old, very very old Vyatta fork, where you can tune two or three settings. Please note that the userland is not MIPS64. Binaries are 32bits. I mocked Ubiquiti on Twitter about that (Cavium®© probably smiled too). That is a shame! Do you remember when I told you manufacturers do crap? Even my Casio GRAPH25 has more features…:

$ uname -a
Linux ubnt 2.6.32.13-UBNT #1 SMP Tue Jun 4 14:54:28 PDT 2013 mips64 GNU/Linux

Results:

$ ping 192.168.5.1
PING 192.168.5.1 (192.168.5.1) 56(84) bytes of data.
64 bytes from 192.168.5.1: icmp_req=1 ttl=64 time=8.21 ms
64 bytes from 192.168.5.1: icmp_req=2 ttl=64 time=0.287 ms
64 bytes from 192.168.5.1: icmp_req=3 ttl=64 time=0.276 ms
64 bytes from 192.168.5.1: icmp_req=4 ttl=64 time=0.288 ms
64 bytes from 192.168.5.1: icmp_req=5 ttl=64 time=0.293 ms
64 bytes from 192.168.5.1: icmp_req=6 ttl=64 time=0.302 ms
[...]
64 bytes from 192.168.5.1: icmp_req=216 ttl=64 time=0.295 ms
64 bytes from 192.168.5.1: icmp_req=217 ttl=64 time=0.290 ms
64 bytes from 192.168.5.1: icmp_req=218 ttl=64 time=0.291 ms
64 bytes from 192.168.5.1: icmp_req=219 ttl=64 time=0.291 ms
^C
--- 192.168.5.1 ping statistics ---
219 packets transmitted, 219 received, 0% packet loss, time 217997ms
rtt min/avg/max/mdev = 0.244/0.326/8.215/0.534 ms

It's time for me to analyze this output. First of all, pings are much more fast. A comparison is not necessary, it's obvious. You can still scroll back, if you have a doubt. The stock kernel has some blobs to increase network performances. I'm not surprised at all.

Secondly, not a single packet was lost. Therefore, the hardware is not defective. This conclusion is irrevocable. Third, I was dumb to buy two units. Seeing those results is just disappointing. I thought I could recycle them on OpenBSD, but the benchmarks were so bad too… I'm going to sell them ASAP! Instead, I will find another candidate (maybe an APU2?). I could have launched other benchmarks on the second unit, but I have had enough of this board and I want to let it untouched/intact.

I'm posting this article to contradict everything I read on blogs. The only thing staying in my mind: how people cannot notice how those performances are awful? If you are still reading my post, please share it. People have to know the miserable truth.