View Single Post
  #204  
Old 03-13-2023, 02:37 PM
Zukan Zukan is offline
Kobold

Zukan's Avatar

Join Date: Jul 2010
Posts: 177
Default

Hey guys here's some of the take aways from the internal friends and family stress tests we did over the weekend, as we prepare for the public alpha which will hopefully be a lot more people. [You must be logged in to view images. Log in or Register.] Theres a bunch of techno-bable that goes over my head but the short of it is our tests were really successful! Things held up, and the stuff that had some issues we've got solutions or already have resolved.

Quote:
Hey all! Over the weekend we held two Stress Tests with our Friends and Family tester community. Ali wrote up a very nice breakdown in our Slack. There was no overly sensitive information, so I thought it'd be fun to share it with you all! Enjoy. (Apologies for the incoming wall of text crit!)
[8:37 AM]
Hey y'all! So some interesting issues, observations and metrics from our load tests.

We got around 80 clients both days.

I will break down these issues into their own threads below.

On the first day these issues were noticed:
1. The immediately noticeable micro stutter that happens approx every 20 seconds. The timing made sense as we have an expensive Core.Models.ClientModel.Save() happening on that timer as well.

2. When massive AoE hits clients, it causes the whole server to stop while it processes what happened.

3. Players that are far away and are supposed to exit your sphere of influence sometimes hung around indefinitely.

On the second day these issues were noticed:
1. There was visible latency on positional updates when a lot of players were close around you, but this improved as you moved away. This was consistent with my mass bot tests last year and we have logic that needs to be fine tuned to counter act this.

2. Someone had made a corpse art piece at the entrance. It was great! Showed us that corpses add unreasonable load.

3. The errors: 7B2EE78E and 6D24DB45 were observed.

General Issues unrelated to Network Test:
1. The model fail errors were noticeable (nameplate visible but invisible model). These may be solved with the new texture and attachment systems but will have to be tested when FnF is updated to latest master.

2. Group member UI bugs. When you zone or die the entire group UI becomes unusable.

Observations:
- As expected, NPCs add a lot more load on the server than Clients. There was virtually no increase in load between <10 players and >70 players.

- The network library held up really well! I always had a nagging fear that with more clients we would see packet queue issues. However, there were none that I could see.

- That said, we need better network profiling tools!

- There is a lot of optimization required in client saves and positional updates especially when we have a lot of users in one area.

Metrics:
- Night harbor (our heaviest zone BY FAR) runs at 2.4% memory required (~400MB), over 41 threads, and uses up CPU load 75% (out of 800% max) @ approx 3GHz (per core, 8 cores on the server). This is at full load with multiple dragons. When there wasn't much going on it was hovering at around 55%. The main driver of this load is navigation and the very large navmesh that NH has.

- Save stuttering was noticed when we got to over 40 clients.

- Positional update stuttering is noticed when more than 40 clients and tons of corpses were at the gate, and you are within 100m of the gate.

Overall:
It was a great success! I think for our first load test, this was amazing and we got a lot of good metrics and I think I have a clear plan for us to solve the above.

Considering nothing crashed, nothing ran out of memory, the zone did not break in any way, and all the issues we found are things fixed with minor optimizations, we are in a really good place. I look forward to fixing them and then testing with hundreds in the same zone!

Microstutters:
I have already solved this. I have converted the save algorithm to a scheduled queue, where we queue the saves on a timer then handle saving over multiple frames and use transactions as much as possible.

Further optimizations: All client saves should be done via queue, even for things that require immediate save, unless we specifically say we need this INSTANTLY (such as trades or taking a port).

We need to have live database performance profiling that we can pull from a running server (maybe by enabling it with a gm command, then extracting the results and also disabling it).
[8:37 AM]
AoE:
We need to look into this one more. There are two things at play here:

1. The client stutters anyways when a lot of VFX is spawned at the same time. We can likely queue these and spawn them over 4-10 frames without it being noticeable at all to the user.

2. I think when the damage, deaths, and buffs are applied, they all immediately save, causing an 80 client mass save in one go destroying our database pipe lol.

Players not exiting SOI:
We need to audit the code that defines this. Maybe there's a logic issue there.

Positional update delays when when many entities (and especially when many clients) exist within a SOI:

This is in effect not too difficult to solve, but explaining it will be tough because it also relies on knowledge of both our SOI system and other techniques that we can move to.

The way SOI works right now for us is that we check all entities against all other entities to see if we need to enter or exit them from any SOI. The logic goes:

for entity a of entities:
for entity b of entities:

var shouldBeInSOI = b.shouldBeInSOIof(a)

if (not in SOI and shouldBeInSOI) then run entryCallbacks
if (inSOI and not shouldBeInSOI) then run exitCallbacks

Then separately we run:

var positionUpdates
for client a of clients:
for entity b of a.sphereofinfluence:
positionUpdates.Append(b.position)
a.send(positionUpdates)

This has a couple of pain points:
1. Building packets is relatively expensive.
2. This type of packet accounts for over 90% of all our traffic.
3. It is exponentially growing load with more clients within each other's SOI.
4. Each client must build its own positionUpdates.

Possible solution:
Client's SOI has many variables (such as being a pet of client, or a party member, or in combat) but a big portion of what defines SOI is distance. This can be substituted for a graph distance. Especially useful would be an octtree, but I see that it may be simpler and ultimately easier to implement a regular 3D grid.

So instead of checking if an entity is X meters away from another entity, we check if the GRIDPOSITION is within X GRIDDISTANCE from another entity.

What this ultimately allows us to do during building packets is to build a positional update PER GRIDPOSITION and then append the relevant ones by directly serializing them like we are doing currently for positionUpdates (skipping the garbage collection), and then finally appending any additional required out of distance SOI entities.

This would be much, much faster in my estimation by orders of magnitude, since we don't need to do positional reads per entity squared.

We may also find that its more suitable to send this all in two packets, one that is faster using only grid position, and another that is slower for in SOI but out of grid distance.

Removal of existing optimizations:
We currently have an optimization where if we exceed a certain number of clients within SOI, we start throttling update packets. This is I believe the main thing being observed with the slower interpolations and especially when clients are jumping around.

Other factors:
Implementing parabolic trajectory code and using that for jumping will smooth out the noticed clipping into the ground when clients are jumping around in high latency or throttled position update situations.

We should also let the client know and allow the client to adjust its expectation of the frequency of updates for smoother interpolation. Currently, client is still interpolating using its original assumptions which can cause jerkiness.
__________________
Reply With Quote