Network performance of the PlayStation 2 with SONY software

Discussion in 'Sony Programming and Development' started by sp193, Sep 19, 2014.

  1. sp193

    sp193 Site Soldier

    Joined:
    Mar 28, 2012
    Messages:
    2,217
    Likes Received:
    1,052
    Hi guys,

    Here's something that I've always wondered: what was the highest transfer rate that anyone has observed SONY PlayStation 2 software obtaining, while transferring data over the network?

    Since 2012, I've been doing thorough research on replacing the SMAP driver and the LWIP protocol stack. I've added DMA support to the homebrew SMAP driver, but I've been always disappointed with the meagre throughput that the PlayStation 2 has been giving me:
    1. TCP/IP stack on IOP: ~2.3MB/s
    2. UDP stack on IOP (with the interrupt stealing hack): ~5.1MB/s
    3. UDP stack on IOP: ~4.4MB/s
    4. TCP/IP stack on EE: ~2.8MB/s
    5. UDP stack on EE: ~3.4MB/s

    The tests were mainly done with HDLDump, while I was developing HDLDump servers v0.9.2, v0.9.3 and v0.8.7. Similar TCP/IP speeds were obtained with HDLGameInstaller too.

    On the same network, my laptop (With a 100Mbit Realtek adaptor) is able to transfer data (over SMB) at about 10MB/s. This shows that it's definitely doable on my network, but seemingly not within reach of the PS2. :(
    My desktop PCs always had Gigabit adaptors and were able to get over 10MB/s through the 100M Internet connection here, but I am not going to directly compare them with the PS2 because they use different Ethernet standards.

    If even SONY software was never able to obtain faster speeds that my tests with the TCP/IP stacks, then it is probably safe to assume that it's just a limitation with the hardware (IOP too insufficient, and the DEV9 DMA channel and SPEED not being suitable for fast Ethernet support here).
     
    Last edited: Sep 19, 2014
  2. smf

    smf mamedev

    Joined:
    Apr 14, 2005
    Messages:
    1,255
    Likes Received:
    88
    I'm surprised that TCP/IP is so much worse than UDP.
     
  3. sp193

    sp193 Site Soldier

    Joined:
    Mar 28, 2012
    Messages:
    2,217
    Likes Received:
    1,052
    It seems to be the IOP's fault. TCP/IP has additional overhead, which in turn results in more packets for the IOP to send and receive.

    Maybe it's possible to lessen the impact of the IOP's inadequacies by using SIF CMD instead of SIF RPC, but that would involve a lot of work to rewrite NETMAN. Therefore, I want to know if SONY could achieve better performance, before I attempt such a rewrite.
     
  4. smf

    smf mamedev

    Joined:
    Apr 14, 2005
    Messages:
    1,255
    Likes Received:
    88
    It's also weird that TCPIP is faster on EE than on IOP, but UDP is slower on EE than IOP. The stack you are using could be contributing, it might not be making the most of cpu caches etc. My guess is you could get it faster, but it's going to be a lot of work. The easiest way of finding out would be to not use a stack but just repeatedly send out pre-canned packets, you won't be able to beat that no matter what code you rewrite (unless you figure out some magical change at that level).
     
  5. sp193

    sp193 Site Soldier

    Joined:
    Mar 28, 2012
    Messages:
    2,217
    Likes Received:
    1,052
    Actually this is the receiving performance. The PS2 sends nearly nothing at all.
    I believe that the 3.4MB/s cap is caused by the IOP, again. If the stack resides on the IOP, performance is probably better then because the IOP has one less interface to be interrupted by: the SIF.

    Someone once put the whole TCP/IP stack and SMAP driver on the EE, and wrote that he got about 4MB/s over TCP/IP. Even I noticed no observable performance impact, if I deliberately made the EE re-copy the data from the SIF DMA buffer. Therefore, it's probably not the EE's fault.

    Yea, I hate the IOP. It seems to be the one responsible for the problems in all of my projects. For example, my i.Link driver is managing only 7MB/s with DMA support... as the IOP must byte-swap the incoming and outgoing data on its own. If the endianess is ignored (regardless of how wrong it is!), the hardware seems to be capable of pushing out around 50MB/s.
    For ATA support, I think that SONY once wrote that we can get about 20MB/s, despite the fact that all disks will be set to use UDMA mode 4 (66MB/s) by default.
     
    Last edited: Sep 21, 2014
  6. smf

    smf mamedev

    Joined:
    Apr 14, 2005
    Messages:
    1,255
    Likes Received:
    88
    I'd have thought they would have just put in byte swapping in hardware, someone made a bad call on that one.
     
  7. Carlos96ps

    Carlos96ps Member

    Joined:
    Aug 18, 2014
    Messages:
    23
    Likes Received:
    0
    Sorry but PS2 has 100MBbit of ethernet hardware right? MAX speed rate should be 100MBits/s or 10 megabytes per second.
     
  8. sp193

    sp193 Site Soldier

    Joined:
    Mar 28, 2012
    Messages:
    2,217
    Likes Received:
    1,052
    It's 100Mb/s, not 100MB/s. That's 100 megabits per second, not megabytes. It has megabit Ethernet hardware, not gigabit.
    And 100Mbit is its bandwidth, not its throughput.

    And like I mentioned in my first post, I was disappointed about its lackluster performance. As you have probably hoped, I also once thought that it would be able to hit 8MB/s with DMA support... but only managed to get about 2MB/s with TCP/IP.
     
  9. smf

    smf mamedev

    Joined:
    Apr 14, 2005
    Messages:
    1,255
    Likes Received:
    88
    Sorry but no. The 100mb/s is the maximum signalling speed, it is not a guarantee of throughput. You could attach a 100mb/s interface to a C64, but it only has a 1mhz bus so it's not going to achieve anywhere near the performance that you think it should. The bytes it sends will be at 100mb/s, but there would need to be many gaps between packets while it assembles the next one.

    A lot of NAS have gigabit ports, but their throughput is closer to 100 mb/s speed. If it can maintain 200 mb/s then it's worth having a gigabit port anyway. It's very unlikely that you'll ever see a connection that is completely saturated, you definitely won't if one end is a playstation 2.
     
  10. Carlos96ps

    Carlos96ps Member

    Joined:
    Aug 18, 2014
    Messages:
    23
    Likes Received:
    0
    10MB Full/Half duplex = 1 MB/sec
    100MB Full/Half duplex = 10 MB/sec

    1GB Full/Half duplex = 100 MB/sec
     
  11. ps2netbox

    ps2netbox Spirited Member

    Joined:
    Dec 26, 2017
    Messages:
    116
    Likes Received:
    93
    formula:
    speed = Window_Size / round-trip time
    After I finished PS2Netbox , I realize ps2's tcp performance can be increased by using larger window size . 5K is too small , since EMAP3 has 16K recv-buffer ;)
    Code:
    #ifdef INGAME_DRIVER
    #define TCP_WND 5120
    #else
    #define TCP_WND 32768
    #endif
    
     
  12. sp193

    sp193 Site Soldier

    Joined:
    Mar 28, 2012
    Messages:
    2,217
    Likes Received:
    1,052
    This thread is now obsolete. Thanks to a suggestion my Maximus32, I changed the queuing system within NETMAN and implemented a multi-threaded system within HDLGameInstaller (to write to the HDD in the background). I observed 5MB/s over TCP within HDLGameInstaller, from the EE. Raw throughput might be actually higher.

    With the upgrading from LWIP v1.4.1 to v2.0.0, speeds have generally improved too, as LWIP v2.0.0 was designed to avoid message-passing.
    I established that the message-passing part caused a rather severe performance-penalty on the IOP: http://psx-scene.com/forums/f19/high-cpu-usage-ps2sdk-lwip-ports-ps2ip-156529/

    That is for OPL, isn't it? It is set low because we do not have enough memory, so there are very few PBUFs.
    If the IOP is held up and the buffers are full, then there will be packet drops.

    For the LWIP v2.0.0 port, I set the window size in the PS2SDK to 32768 for the IOP and 65535 for the EE.
     
  13. ps2netbox

    ps2netbox Spirited Member

    Joined:
    Dec 26, 2017
    Messages:
    116
    Likes Received:
    93
    Yes, it is for OPL.
    But 5K is really small since EMAP3 has 16K buffer. I mean , even if Zero IOP buffer is used , recv windows should be 16K.

    I think network performance can be increased by (eg in HDL game installer):
    1) do not use interrupt .
    interrupt needs lots of cpu .
    just polling serveral ms a time.
    2) dma from smap , dma to ide ( ie: ide data must be aligned )
    3) Ignore any data check sum . since Ethernet use CRC32 .
    4) rewrite tcp/udp stack with a simple version .
     
  14. sp193

    sp193 Site Soldier

    Joined:
    Mar 28, 2012
    Messages:
    2,217
    Likes Received:
    1,052
    Hmm. That makes sense; as long as nobody sends more than the Rx FIFO size, it is okay. Thanks.

    For our driver, we are using the interrupt. But once it has asserted once, we disable the interrupt and wake up a handler thread. The thread will poll until there are no more frames to process:
    SMAP Interrupt handler: https://github.com/ps2dev/ps2sdk/blob/master/iop/network/smap/src/smap.c#L503
    Interrupt-handling part of SMAP event thread: https://github.com/ps2dev/ps2sdk/blob/master/iop/network/smap/src/smap.c#L441
    Rx-handling code that runs until the next BD is empty: https://github.com/ps2dev/ps2sdk/blob/master/iop/network/smap/src/xfer.c#L72

    This is how the HDLDump servers were designed to work. But the IOP is too weak to handle fast transfers with TCP/IP. :(
    HDLDump v0.9.0 uses UDP for everything, and uses its own protocol stack (PKTDRV). But the problem is that it uses UDP.

    I believe it is because of thread-switching too. If I handled the RXEND interrupt from the interrupt, then I was told that the user could get up to 5.9MB/s with HDLDump v0.9.0 (hence the custom HDLDump v0.9.3 server).
    But this will also mean that the threads will not run as often, so there is a performance loss on the IOP.... :|

    We do try to keep the buffers aligned. A long time ago, we used to have code for handling misaligned frames.
    At some point, this bug was fixed in LWIP. For OPL, I patched SMSTCPIP sometime ago.


    My data format has no checksum because I use TCP.
    Unfortunately, the SMAP does not seem to support hardware-based CRC32 computation, so there is CRC32 calculation. :|

    ***

    Maximus32 got very good Ethernet speeds (6-8MB/s, I believe), but he just did DMA from the interrupt, with SIFCMD. His DEV9 code does not wait for DMA transfer completion, so it involves modifying DEV9.
    I think it works well in his case because he is working on PS2Linux and doesn't need to cater for IOP-side threads (all processing is done on the EE for PS2Linux). On the other hand, I made NETMAN to have the choice of supporting either an EE or IOP-side stack.

    If avoiding threads is really the only way to get 6+MB/s, then maybe I have no other choice. But before that, I really want to try to derive a system that can support threads first.
    We are already surpassing Sony's performance. They only got 28Mbps, but we're getting about 40Mbps.

    I was considering modifying NETMAN again, so that it is possible to avoid using SIFRPC as much as possible. But modifying and debugging the network functionality was very expensive for me (need a lot of effort and time), so I decided to leave things as they are. :(

    Right now, NETMAN uses a ring buffer. The number and lengths of all frames in the whole ring buffer are kept in PacketReqs. A copy of PacketReqs is always passed to the EE at SifCallRpc. This allows the IOP to update PacketReqs, as the EE will have its own copy.

    When the EE completes handling the all frames, then it indicates to the IOP how many frames were processed and the IOP will know how much more frames can be sent over. However, this means that the IOP can only send more frames to the EE, once the EE completes processing them and the RPC call completes.

    While the IOP can only inform the EE of new frames via SifCallRpc, frames can be transferred over at any time with DMA. This allows some saving of time.

    So what I planned to do next, is to put the size word before each frame slot (size, padding bytes, and then the frame data), which will allow the IOP to directly update the size field on the EE with DMA. If the size field is zero, then the EE & IOP will agree that the frame slot is not in use. There will be wastage of RAM (padding bytes are added because of DMA alignment), but this might improve performance.

    This design should allow the IOP to send new frames to the EE via DMA transfer, without SifCallRpc, as long as IOP knows that the EE is still processing frames (can check RPC status or even just poll the completion semaphore with PollSema). If not, then the IOP can "wake up" the EE with SifCallRpc.

    This same design will also be implemented for transmission (EE -> IOP).

    However, I am reluctant to try again because everytime I change NETMAN's design, it breaks and is very difficult to fix! I do not know for usre if it will really improve performance, but I think it has a chance...

    I also planned to implement some benchmarking system, to determine where the holdup is in the system. But I have not done it because I got tired too.
    I think the T15000 Performance Analyzer is the official way to benchmark, but mine is a normal T10000H. :(

    If you have any better suggestions that can fit my goals, I will be very grateful.
    I am also grateful for Maximus32's suggestions and help too, but I think I want to try other solutions first.
     
    Last edited: Jan 23, 2018
  15. ps2netbox

    ps2netbox Spirited Member

    Joined:
    Dec 26, 2017
    Messages:
    116
    Likes Received:
    93
    Given speed = Window_Size / round-trip time. This is very import .
    When transfer through network , one must transfer as many as possible data before waiting ack .

    This may need much cpu than polling .
    ( Since we have a 16K window size , expected speed = 10M , 1ms is enough ).
    This must be wrong . crc32 computing is basic function of mac .

    Ethernet frame :
    dest addr (6 bytes ) src addr(6bytes) type(2bytes) ethernet payload crc32 ( 4 bytes) .
    user should never care about the last 4bytes crc32.
    ethernet payload like :
    ip header (20bytes or more ) tcp /udp header tcp/udp payload

    ip hader has checksum , tcp/udp hader also , and tcp has a checksum field include all payload .
    So validating tcp checksum will need lots cpu since you must read all data.
    This is the main reason why TCP is much slower than UDP .

    consider that :
    1. TCP/IP stack on IOP: ~2.3MB/s
    2. UDP stack on IOP (with the interrupt stealing hack): ~5.1MB/s

    it seems tcp checksum needs half of total cpu .
    If above test include data copy ( eg : first dma from smap to lwip's pbuf ,then copy it to user buffer ).
    Without this copy , the speed can reach 8M/s or even high.

    I think one can get even >10M on PS2 .
     
    wisi likes this.
  16. sp193

    sp193 Site Soldier

    Joined:
    Mar 28, 2012
    Messages:
    2,217
    Likes Received:
    1,052
    Why would it be more? Shouldn't polling worsen the performance of the CPU in general?

    I am referring to checksum offload functionality, for IPv4. I guess, yes... the MAC does support CRC32 itself. Thanks for the information.

    Then now I wonder: if all MACs could do this, then why don't they all have IPv4 checksum offload? :|
    This is one of the things that I don't think the SMAP has. Or at least, I don't know how to enable it.

    It is possible that it takes up a lot of CPU time, but the two tests I mentioned were not only done with different protocols, but the whole protocol stack was changed.

    TCP/IP was done with LWIP, while UDP was done with PKTDRV. This comparison is flawed, if you want to compare the performance aspect of the transport-layer protocols. These are some things that I did not do well.

    I forgot what version of LWIP was used, but it might have been v1.4.0. Up to LWIP v1.4.1, there used to have a main thread, which we pass frames to this thread with a message box. I have done some tests and concluded that this part here can result in very low performance, which is avoided in SMSTCPIP (the author avoided entering kernel functions like WaitSema, if there are messages to receive).

    LWIP v2.0.0 no longer does message-passing for incoming frames, so this performance penalty has been relieved. It is still quite bad. Even if one the checksum verification, it is still quite bad.
    PKTDRV doesn't have this message-passing part, from what I remember. It does not have a lot of things too, like the sockets layer. I never did a comparison of LWIP (raw API) against PKTDRV.

    But we cannot really avoid this, unless we use the low-level LWIP APIs. But without the sockets layer, the software must be made to support LWIP.
    So my solution is to just put the protocol stack on the EE.

    If LWIP is on the EE and if NETMAN is used to interface LWIP with SMAP, there is no copying of data on the IOP side (only DMA transfers). However, it is still substandard in performance, if you say that we can get more than 10MB/s...
    Particularly in terms of transmission, since I could only get like 2+MB/s with it.

    If you can, that will be very good. I've already given up on making this thing work well, since there are so many factors to consider. :|

    Thank you for your information.
     
    Last edited: Jan 24, 2018
    wisi likes this.
  17. ps2netbox

    ps2netbox Spirited Member

    Joined:
    Dec 26, 2017
    Messages:
    116
    Likes Received:
    93
    I think high speed can not reached by reusing exist irx .
    One must write a custom irx ( copy code from dev9,smap,atad ).
    This can be used by hdddump but not OPL .
     
  18. sp193

    sp193 Site Soldier

    Joined:
    Mar 28, 2012
    Messages:
    2,217
    Likes Received:
    1,052
    Yeah. :(

    But thanks for reading my long story and for listening to me complain about how the IOP is horrible.
    I started work on this project in 2011 or so. There were so many problems, but not all of them were related to networking; the EE kernel syscalls had bugs, the homebrew SIFRPC library was bugged, there was no network manager yet, and then there were some threading problems caused by the LWIP code not being 100% compatible with the PS2 kernel design (i.e. you cannot SignalSema when interrupts are disabled). So that's why I got tired. :D

    I will make a pull request for OPL, to adjust the TCP window size. Thanks.
     
    AKuHAK and wisi like this.
  19. ps2netbox

    ps2netbox Spirited Member

    Joined:
    Dec 26, 2017
    Messages:
    116
    Likes Received:
    93
    speed = Window_Size / round-trip time
    if: expected speed = 8MB/s
    windows_size = 16K * 1024 /( 128 bytes header + 1024 bytes data) ~=14K.
    we get : round-trip time <= 1.7 ms.
    so I think 1ms polling inverval is enough . But 2ms is too long .
    Since smap can not do bus master ( eg ,Only IOP can start dma ). The must recv window can not large than 14K without using interrupt .

    This will only use 1024+128 bytes memory of IOP.

    Code:
       while(wait_alarm()){
         while(have_recv_data) {
            dma_from_smap_first_128_bytes_header 
            check header 
            dma_from_smap_1024_bytes_data  
            dma_to_hdd 
         }
       }
    [code]
    
    [/SPOILER]
     
  20. ps2netbox

    ps2netbox Spirited Member

    Joined:
    Dec 26, 2017
    Messages:
    116
    Likes Received:
    93
    This is a bad news to me ;)
    When I sell PS2NETBOX . Some one will say , Official OPL can do this with SMB ;)
     
    wisi likes this.
sonicdude10
Draft saved Draft deleted
Insert every image as a...
  1.  0%

Share This Page