|
horati
|
 |
« on: 13. December 2009, 02:35:03 PM » |
|
I am not positive I understand the exact intent of this code so I'll post here for Marvin to review. In UpdatingThread.TimingMode, I found a bunch of switch( this ) statements. Shouldn't these use math and polymorphism for greater efficiency and smaller code size? It is obviously a very tiny optimization.
|
|
|
|
|
Logged
|
|
|
|
Marvin Fröhlich
Xith Lord
Administrator
Guru
   
Offline
Posts: 4337
May the 4th, be with you...
|
 |
« Reply #1 on: 13. December 2009, 09:01:51 PM » |
|
How would that look like? I don't see a way to optimize this piece of code.
Marvin
|
|
|
|
|
Logged
|
|
|
|
|
horati
|
 |
« Reply #2 on: 13. December 2009, 11:43:41 PM » |
|
Enumerations are just syntactic sugar for classes. They are allowed to carry full state with them. Normally, this can be dangerous but was put into the language for cases just like this. switch( this ) { case foo: return a / 1000; case bar: return a / 1000000; case blah: return a / 1; default: throw new Exception(); }
can be replaced with return a / this.divisor;
I noticed that you already initialized a bunch of enumeration state. I didn't trace through all the logic to see exactly what variables were used where, but it should be easy to replace with code very similar to above.
|
|
|
|
|
Logged
|
|
|
|
Marvin Fröhlich
Xith Lord
Administrator
Guru
   
Offline
Posts: 4337
May the 4th, be with you...
|
 |
« Reply #3 on: 13. December 2009, 11:48:50 PM » |
|
Well, I thought, a switch should be cheaper than a division. This is the reason, why I did it this way. So for the case, where only the plain value is returned, we gain a little performance and for the others it shouldn't matter too much.
Do you still think, it should be done as you suggested?
Marvin
|
|
|
|
|
Logged
|
|
|
|
|
horati
|
 |
« Reply #4 on: 14. December 2009, 12:20:02 AM » |
|
Well, I thought, a switch should be cheaper than a division. This is the reason, why I did it this way. So for the case, where only the plain value is returned, we gain a little performance and for the others it shouldn't matter too much.
Do you still think, it should be done as you suggested?
Actually, I do. Taken individually, the division instruction is more expensive than the branch instruction on old hardware. I'm not actually sure on the modern CPUs. The problem with branches is that they hurt the instruction pipeline. When the instructions are single path, most modern CPUs can execute up to 17 instructions per cycle; however, they only have 4 pipelines so they can only follow 4 branch possibilities up to 17 instructions deep. In normal code which has lots of branches, instructions beyond 4-9 are usually reversed because an unanticipated branch is taken instead. With inlining of simple methods plus pipelining, I believe that you will gain much more from simplifying those types of methods. P.S. The exact numbers I used are from memory. I would have to go back to various Intel and AMD whitepapers that I have not read in a while to confirm exactly how many instructions and pipelines per CPU model. Even if my memory failed on the numbers, the reasoning is valid.
|
|
|
|
|
Logged
|
|
|
|
|
horati
|
 |
« Reply #5 on: 14. December 2009, 12:37:00 AM » |
|
One more thing. Except in the most critical inner loops, reduced code size is at least as important as inlining.
Bytes in registers are read/written in 1 cycle. Bytes in L1 cache are read/written in 2 cycles. L1 cache lines are 64 bytes on both Intel and AMD processors. Typical read ahead optimization reads 2 lines on AMD and 4 lines on Intel. (The difference is because access to outer caches is less expensive on AMD so they don't prefetch as much.)
Bytes moved from L2 to L1 cause a wait of approximately 18 cycles. Moving from main memory to L2 is drastically more expensive and varies substantially between AMD and Intel. Luckily, prefetching algorithms cause these waits to occur very infrequently unless you have extremely little locality of reference.
Java makes it difficult to predict locality of reference although it is possible in some situations.
Edit:
Even 2 cycles for L1 reads/writes might be outdated. The modern processors probably do this in 1 cycle. These kinds of details tend to change very frequently. The larger point is the drastic difference between RAM, L2, and L1.
|
|
|
|
« Last Edit: 14. December 2009, 12:42:32 AM by horati »
|
Logged
|
|
|
|
Marvin Fröhlich
Xith Lord
Administrator
Guru
   
Offline
Posts: 4337
May the 4th, be with you...
|
 |
« Reply #6 on: 14. December 2009, 01:20:56 PM » |
|
Ok, you got me. I have changed that in UpdatingThread.TimingMode. Please feel free to verify it.
Marvin
|
|
|
|
|
Logged
|
|
|
|
|
horati
|
 |
« Reply #7 on: 14. December 2009, 02:16:02 PM » |
|
I'm glad I was able to sway you. Obviously, anything we do for that one enumeration is completely trivial.
The kind of reasoning I outlined was used to make MonetDB/X100 an order of magnitude faster than MonetDB. Obviously, they had much more control than we do because they did EVERYTHING in highly optimized (aka illegible) C with lots of testing of every loop and function on several processor platforms.
Depending how widespread attempts to speed up computation using logic branches are, we could easily see a 50-200% speed boost of the code that actually runs within the JVM by reducing the code size and simplifying the instruction pipeline. I have no idea how much the framerate is determined by code running within the JVM versus the graphics card.
|
|
|
|
|
Logged
|
|
|
|
Marvin Fröhlich
Xith Lord
Administrator
Guru
   
Offline
Posts: 4337
May the 4th, be with you...
|
 |
« Reply #8 on: 14. December 2009, 03:30:05 PM » |
|
I'm glad I was able to sway you.
Well, you seemed to have digged into it pretty deeply. So you just had to be right  . Obviously, anything we do for that one enumeration is completely trivial.
Sure. Tbh. I don't like my whole TimingMode approach anymore since a long time. I always wanted to change it to perform even better and to get a cleaner and better to understandable API by predividing the nanotime to micors and millis and even fractions of seconds as float and simply passing these four values to the update methods. This way the computations would be done once and only once per frame, which is completely acceptable and wouldn't mean to divide the value for each use case. But I already had passed the line, where I wouldn't want to introduce not-so-necessary API changes and so I delayed it for v2.0. I have no idea how much the framerate is determined by code running within the JVM versus the graphics card.
Well, that depends of how much DisplayLists, VBOs and shader programs you use. So it could be 99% or 5% or whatever. Marvin
|
|
|
|
|
Logged
|
|
|
|
|
horati
|
 |
« Reply #9 on: 14. December 2009, 09:15:12 PM » |
|
Well, you seemed to have digged into it pretty deeply. So you just had to be right  . Actually, that is true. And now that I am thinking about it: my original posted stated that L2 fetches are ~18 cycles. That is for Intel CPUs. IIRC, they are ~8 cycles on AMD. That is the primary reason why AMD can achieve similar performance numbers from slower CPUs. Tbh. I don't like my whole TimingMode approach anymore since a long time. I always wanted to change it to perform even better and to get a cleaner and better to understandable API by predividing the nanotime to micors and millis and even fractions of seconds as float and simply passing these four values to the update methods. This way the computations would be done once and only once per frame, which is completely acceptable and wouldn't mean to divide the value for each use case. But I already had passed the line, where I wouldn't want to introduce not-so-necessary API changes and so I delayed it for v2.0.
How many levels of animate calls are there? If it is only 2 or 3 levels deep, that makes sense. Obviously, calculating it once makes sense regardless. I am just thinking that there are some other approaches if the call hierarchy gets really deep. Well, that depends of how much DisplayLists, VBOs and shader programs you use. So it could be 99% or 5% or whatever.
Is it fair to say that a "typical" application might run 50% of its rendering via Xith code running inside the JVM? If so, a 50+% throughput increase in the JVM code is significant. If a "typical" application runs less than 20% of its code inside the JVM, then the boost is interesting but less impressive.
|
|
|
|
|
Logged
|
|
|
|
Marvin Fröhlich
Xith Lord
Administrator
Guru
   
Offline
Posts: 4337
May the 4th, be with you...
|
 |
« Reply #10 on: 15. December 2009, 08:37:55 PM » |
|
How many levels of animate calls are there? If it is only 2 or 3 levels deep, that makes sense. Obviously, calculating it once makes sense regardless. I am just thinking that there are some other approaches if the call hierarchy gets really deep.
there should be only a few (1 or 2 I guess), because all the Animatables, Updatables, ScheduledOperations, etc. are added to the RenderLoop/Updater/OperationScheduler directly. I guess, you're referring to the stack size and big signature problems. This could be circumvented by only passing some kind of a GameTime class instance, which is fed with precomputed nanos, micors, millis and seconds and that provides getters for these values. Is it fair to say that a "typical" application might run 50% of its rendering via Xith code running inside the JVM? If so, a 50+% throughput increase in the JVM code is significant. If a "typical" application runs less than 20% of its code inside the JVM, then the boost is interesting but less impressive.
I really cannot tell. It depends on the way, Xith is used. I don't know about any statistics. Marvin
|
|
|
|
|
Logged
|
|
|
|
|
horati
|
 |
« Reply #11 on: 15. December 2009, 09:46:47 PM » |
|
there should be only a few (1 or 2 I guess), because all the Animatables, Updatables, ScheduledOperations, etc. are added to the RenderLoop/Updater/OperationScheduler directly. I guess, you're referring to the stack size and big signature problems. This could be circumvented by only passing some kind of a GameTime class instance, which is fed with precomputed nanos, micors, millis and seconds and that provides getters for these values.
Yep. The majority of the cost of non-inlined code is pushing parameters onto the stack. Storing the ultimate source of timing truth as an attribute in all those classes who might care allows you to pass it once into the constructor and reach it from the "this" pointer after that. In that way, fewer things need to be pushed onto the stack. This reduces stack size and simple computation time to make the call. In general purpose code, I never consider such things; however, I know how much difference you have told us inlining makes. In a like vein, reducing the parameter count makes non-inlined calls cheaper.
|
|
|
|
|
Logged
|
|
|
|
|