John Carmack:You need a 3.68 Teraflop GPU to match PS4's GPU

This topic is locked from further discussion.

Avatar image for Lucianu
Lucianu

10347

Forum Posts

0

Wiki Points

0

Followers

Reviews: 6

User Lists: 0

#1001 Lucianu
Member since 2007 • 10347 Posts

why do people feed the chart spammer

wis3boi

Why do you insult the most enlightened being from the Master Race?

Avatar image for ronvalencia
ronvalencia

29612

Forum Posts

0

Wiki Points

0

Followers

Reviews: 1

User Lists: 0

#1002 ronvalencia
Member since 2008 • 29612 Posts

[QUOTE="ronvalencia"]

You missed "TBB (For Multicore CPU path only) (4.1 Update 1 or Above)." TBB = Intel's Threading Building Blocks.

The demo didn't use AMD Trinity's IGP.

AMD groups "Unified Address Space" groups "Paging over PCI-Express (discrete)" and "shared memory controller (APU)" together.

tormentos

I miss nothing AMD say so HSA full advantages requires HUMA good luck getting that on PC.

"Yes those are full HSA as soon as you put them on an APU like the 78XX already demostrated,and the 77XX as well,as well as other lower range GPU on amd apus."

You missed "TBB (For Multicore CPU path only) (4.1 Update 1 or Above)." TBB = Intel's Threading Building Blocks.

---------------

AMD's "Unified Address Space" groups "Paging over PCI-Express (discrete)" and "shared memory controller (APU)" together.

2q3ucg7_zpscd169195.jpg

Good luck with 1080p resolution games.

Avatar image for AdobeArtist
AdobeArtist

25184

Forum Posts

0

Wiki Points

0

Followers

Reviews: 1

User Lists: 0

#1003 AdobeArtist  Moderator
Member since 2006 • 25184 Posts

nerds_vs_athletes1.png..

AMD655

This... THIS is the only correct answer :D :D

Avatar image for ronvalencia
ronvalencia

29612

Forum Posts

0

Wiki Points

0

Followers

Reviews: 1

User Lists: 0

#1004 ronvalencia
Member since 2008 • 29612 Posts

[QUOTE="ronvalencia"]Both PC and X1 has two memory pools i.e. fast/smaller memory pool and slower/larger memory pool. For the X1, tiled resource is a requirement.

PC has the option to delay the use of tiled resource until it reaches GDDR5 size limits e.g. it could range from 1GB to 6GB.

--------------

I disagree with "PCI-E bus being too high for HSA compute usage".

1. ATC has a look-ahead feature for isochronous devices to avoid untimely table walk latencies.

2. Reasonably low PCI-E hardware latency for PCI-E version 3.0.

3. Full duplex bus with high enough bandwidth i.e. greater than 20 GB/s.

2q3ucg7_zpscd169195.jpg

Notice HSA's "Unified Address Space" groups "Paging over PCI-Express (discrete)" and "shared memory controller (APU)" together.

-------------

HSA software stack supports X86-64 CPUs i.e. the same as the current AMD's OpenCL CPU drivers working on Intel CPUs.

btk2k2

Well you seem to think the X1 ESRAM is a L4 cache. So if it is an L4 cache that would mean it is different to a PC memory architecture. If it is not an L4 cache then the small 'fast' pool is not small, it is tiny in comparison to the small fast pool on a modern GPU. The PCIe bus has too many overheads for the most efficient HSA usage. It can be used in specific scenarios but an APU with hUMA will be more efficient. The hardware latency can be 0 but if the software adds necessary latency then the end result is the same. The extra steps cause a problem for optimal HSA use. The software stack is just one part of the argument. Yes you can use the software stack to implement HSA on intel hardware but without the hardware support of a hUMA then you are running in sub optimal conditions. This is a fact and my only argument is that HSA on a supported APU with hUMA is better than software only HSA (which is what dGPU HSA is).

Current AMD GCN's instruction set fully supports HSAIL's instruction set.

Both APU's shared memory controllers and dGPU's paging via PCIe (e.g. ATC, ATS, PRI, BAR) are hardware constructs.

The mapping of GPU's address space to CPU's address space is done via PCI-e BAR hardware.

BAR == Base Address Register PCI mechanism for mapping device memory into sys. address space.

PCI-e version 3.0 can resize BAR.

-------


Again, AMD's "Unified Address Space" groups "Paging over PCI-Express (discrete)" and "shared memory controller (APU)" together in fixing the "Copy Overhead" issue.

I/O latency effciency is one of many factors that influences compute performance. The other factor is the actual compute processing latancy i.e. obtaining results from stream prcoessors units. The greater the CU count from 7970 equals faster results turn-around from the stream processors.

PS ; The statement for "ATC has a look-ahead feature for isochronous devices to avoid untimely table walk latencies." was from AMD's Mark Hummel.

In operation, ATC (Address Translation Cache) is similar to CPU's TLB (Translation Lookaside Buffer). ATC is also known as IOTLB and use to reduce overheads.

Avatar image for ronvalencia
ronvalencia

29612

Forum Posts

0

Wiki Points

0

Followers

Reviews: 1

User Lists: 0

#1005 ronvalencia
Member since 2008 • 29612 Posts

[QUOTE="ronvalencia"] You are forgetting PS4's CPU IO is limited <20 GB/stormentos

Which is more than fine for that Jaguar...

Oh and are we forgeting the 10GB connection bentween GPU and CPU.?

Because the 20GB one is from the memory to the CPU.

Once again you on purpose ignoring one of the connetion of the CPU,so is 20GB from CPU to memory,10GB from CPU to GPU and 176GB from memory to GPU as well as 10GB from GPU to CPU.

hUMA is about hitting RAM i.e. a puzzle located in RAM.

Avatar image for ronvalencia
ronvalencia

29612

Forum Posts

0

Wiki Points

0

Followers

Reviews: 1

User Lists: 0

#1006 ronvalencia
Member since 2008 • 29612 Posts

[QUOTE="ronvalencia"]

[QUOTE="tormentos"]

Yes it is getting it does that mean it will work in all computers out there.?

No it will wok on HSA systems period if the system is not HSA it will not work,hell the great great great great majority of GPU on PC aren't HSA even compatible like the 7000 series is.

btk2k2

AMD HSA software stack has two parts

1. HSA JIT re-complier that targets CPU. Both AMD and Intel has common ISA for this target.

2. HSA JIT re-complier that targets GPU. Atm, AMD only supports it's own GPU. Other 3rd parties can add their own HSA re-complier.

------------------

For AMD GPU and HSA requirements

Refer to https://github.com/HSA-Libraries/Bolt

TBB (For Multicore CPU path only) (4.1 Update 1 or Above). TBB = Intel's Threading Building Blocks.

AMD APU Family with AMD Radeon HD Graphics

  • A-Series
  • C-Series
  • E-Series
  • E2-Series
  • G-Series
  • R-Series

AMD Radeon HD Graphics

  • 7900 Series (7990, 7970, 7950)
  • 7800 Series (7870, 7850)
  • 7700 Series (7770, 7750)

AMD Radeon HD Graphics

  • 6900 Series (6990, 6970, 6950)
  • 6800 Series (6870, 6850)
  • 6700 Series (6790 , 6770, 6750)
  • 6600 Series (6670)
  • 6500 Series (6570)
  • 6400 Series (6450)
  • 6xxxM Series

None of that has support for hUMA though. On the PC side Kaveri will be the first hUMA enabled APU. hUMA is a big part of what makes HSA really interesting because it does away with existing overheads which allows a greater level of granularity in what is sent to the CPU and what is sent to the GPU. It also allows for GPU efficient code to have more CPU dependence since not only is the data stored in the same address pool but that address pool is on the same physical bus. With the PS4 (X1 might be the same but details have not been released) the GPU can talk directly to the CPU cache which means you do not even need to hit the memory interface to get data from the GPU to the CPU and vice versa. I am not sure if Kaveri will have that functionality but it would make sense.

AMD Kaveri APU also comes with host PCIe version 3.0 support.

It's clear you don't understand the full PCIe version 3.0 protocols.

Intel's POV on PCIe version 3.0

For an example: IO/accelerator writes data to a host memory and

later host CPU reads the data out of the memory. This sequence that includes two accesses

to memory/DRAM can be optimized by IO writing data directly to a cache and CPU reading

out of cache. This optimization would improve performance (reduced access latencies) and

significantly reduce consumed bandwidth and power of system resources (e.g. of DRAM as

well as of cache-coherent system interconnect such as QPI)

...

PCIe Atomic RMW is defined as a symmetric capability i.e. it supports operations: IO<->HostMem, HostCPU<->IO, IO<->IO

PCIe 3.0 includes Data Reuse/Caching Hints (DRH) mechanism that deals with direct CPU cache access. Refer to AMD's "Don't Move The Data" slide with current PC GCNs.

All AMD PC GCNs supports native PCI-E version 3.0 protocols.

PS; IO = PCIe 3.0 accelerator.

AMD 990FX chipset doesn't natively support PCI-E version 3.0 i.e. needs a half baked bridge chip.

AMD Trinity APU doesn't natively support PCI-E version 3.0.

--------------


All PC owners without Intel Ivybridge, Intel Haswell and AMD Kaveri APU are behind the next-gen consoles.

Avatar image for ronvalencia
ronvalencia

29612

Forum Posts

0

Wiki Points

0

Followers

Reviews: 1

User Lists: 0

#1007 ronvalencia
Member since 2008 • 29612 Posts

None of that has support for hUMA though. On the PC side Kaveri will be the first hUMA enabled APU. hUMA is a big part of what makes HSA really interesting because it does away with existing overheads which allows a greater level of granularity in what is sent to the CPU and what is sent to the GPU. It also allows for GPU efficient code to have more CPU dependence since not only is the data stored in the same address pool but that address pool is on the same physical bus. With the PS4 (X1 might be the same but details have not been released) the GPU can talk directly to the CPU cache which means you do not even need to hit the memory interface to get data from the GPU to the CPU and vice versa. I am not sure if Kaveri will have that functionality but it would make sense.

btk2k2

On APU's shared memory controllers, the memory modules are seperated by memory channels.

======== Crossbar ============

|MCH 1| |MCH 2| ..... |MCH n|


....

With dGPU GCN.

======== Crossbar ========= |PCI 3.0 host|

|MCH 1|, |MCH 2| ..... |MCH n| , |paging va PCIe|

PCIe Atomic RMW is defined as a symmetric capability i.e. it supports operations: IO<->HostMem, HostCPU<->IO, IO<->IO

HostCPU<->IO is similar to HostCPU<->hostMCH

IO<->HostMem is similar to hostMCH<->hostMCH

Avatar image for btk2k2
btk2k2

440

Forum Posts

0

Wiki Points

0

Followers

Reviews: 0

User Lists: 0

#1008 btk2k2
Member since 2003 • 440 Posts
Lets deal with this bit by bit.
Current AMD GCN's instruction set fully supports HSAIL's instruction set.

Both APU's shared memory controllers and dGPU's paging via PCI-E (ATC, ATS, PRI) are hardware constructs.

Again, AMD's "Unified Address Space" groups "Paging over PCI-Express (discrete)" and "shared memory controller (APU)" together in fixing the "Copy Overhead" issue.ronvalencia
I never said otherwise so why bring it up? I was not talking about any hardware to enable the paging. I was talking about how the dGPU memory pool and the main memory pool is viewed as one block of memory in software. That is a software construct because there is no other way to achieve it since they are sitting on different buses and are physically separated. I am sure that it does help to make HSA more useful on a dGPU but it still requires data to cross over PCIe which will incur a performance penalty.
I/O latency effciency is one of many factors that influences compute performance. The other factor is the actual compute processing latancy i.e. obtaining results from stream prcoessors units. The greater the CU count from 7970 equals faster results turn-around from the stream processors.ronvalencia
I never said overall performance of dGPU is slower than an APU. I just said the APU is more efficient thanks to hUMA, if you had an APU with the same compute performance as a dGPU the APU would be faster at HSA than the dGPU. It also means that if your code requires a lot of back and forth between the CPU and the GPU the APU is likely to be faster than even a 7970 thanks to the lower overhead.
PS ; The statement for "ATC has a look-ahead feature for isochronous devices to avoid untimely table walk latencies." was from AMD's Mark Hummel.

In operation, ATC (Address Translation Cache) is similar to CPU's TLB (Translation Lookaside Buffer). ATC is also known as IOTLB and use to reduce overheads.ronvalencia
That is nice. Reducing overheads is great but reducing them is not the same as eliminating them and if you have two ways of doing something and one has more overheads than the other, regardless of overall performance, the one that has fewer overheads is more efficient. Given comparable compute characteristics the more efficient one will also be faster but the higher overhead way might enable a more brute force approach. Nothing you have posted shows that the statement "An APU with hUMA will be more efficient at HSA than a dGPU will be at HSA" is false. It does mean that there might be certain use cases that only work on an APU with hUMA because the overhead of using HSA on a dGPU is too high for it to work. It does not mean an APU will always be faster than a dGPU because a dGPU can brute force the performance if the use case is suitable. I am sure that you will bring up something else but no matter what you say, the simple fact the the data has to travel a longer distance means there is more latency because the physical electrons in the wires have a finite speed. Then there is the software induced latency which can be made more efficient but where do you think the largest gains have been made so far?

a) CPU(APU) --> main memory b) dGPU --> main memory c) CPU --> dGPU memory
Avatar image for btk2k2
btk2k2

440

Forum Posts

0

Wiki Points

0

Followers

Reviews: 0

User Lists: 0

#1009 btk2k2
Member since 2003 • 440 Posts

[QUOTE="btk2k2"]

None of that has support for hUMA though. On the PC side Kaveri will be the first hUMA enabled APU. hUMA is a big part of what makes HSA really interesting because it does away with existing overheads which allows a greater level of granularity in what is sent to the CPU and what is sent to the GPU. It also allows for GPU efficient code to have more CPU dependence since not only is the data stored in the same address pool but that address pool is on the same physical bus. With the PS4 (X1 might be the same but details have not been released) the GPU can talk directly to the CPU cache which means you do not even need to hit the memory interface to get data from the GPU to the CPU and vice versa. I am not sure if Kaveri will have that functionality but it would make sense.

ronvalencia

On APU's shared memory controllers, the memory modules are seperated by memory channels.

======== Crossbar ============

|MCH 1| |MCH 2| ..... |MCH n|


....

With dGPU GCN.

======== Crossbar ========= |PCI 3.0 host|

|MCH 1|, |MCH 2| ..... |MCH n| , |paging va PCIe|

PCIe Atomic RMW is defined as a symmetric capability i.e. it supports operations: IO<->HostMem, HostCPU<->IO, IO<->IO

HostCPU<->IO is similar to HostCPU<->hostMCH

IO<->HostMem is similar to hostMCH<->hostMCH

PCIe 3.0 latency is the same as PCIe 2.0 latency. http://www.anandtech.com/show/5261/amd-radeon-hd-7970-review/10 snippets below. (Emphasis mine)
So why is PCIe 3.0 important then? Its not the games, its the computing. GPUs have a great deal of internal memory bandwidth (264GB/sec; more with cache) but shuffling data between the GPU and the CPU is a high latency, heavily bottlenecked process that tops out at 8GB/sec under PCIe 2.1. And since GPUs are still specialized devices that excel at parallel code execution, a lot of workloads exist that will need to constantly move data between the GPU and the CPU to maximize parallel and serial code execution. As it stands today GPUs are really only best suited for workloads that involve sending work to the GPU and keeping it there; heterogeneous computing is a luxury there isnt bandwidth for.

With PCIe 3.0 transport bandwidth is again being doubled, from 500MB/sec per lane bidirectional to 1GB/sec per lane bidirectional, which for an x16 device means doubling the available bandwidth from 8GB/sec to 16GB/sec. This is accomplished by increasing the frequency of the underlying bus itself from 5 GT/sec to 8 GT/sec, while decreasing overhead from 20% (8b/10b encoding) to 1% through the use of a highly efficient 128b/130b encoding scheme. Meanwhile latency doesnt change its largely a product of physics and physical distances but merely doubling the bandwidth can greatly improve performance for bandwidth-hungry compute applications.Anandtech
PCIe 3 is an improvement no doubt about it but compared to an APU with hUMA it is still full of bottlenecks and performance issues.
Avatar image for WilliamRLBaker
WilliamRLBaker

28915

Forum Posts

0

Wiki Points

0

Followers

Reviews: 0

User Lists: 0

#1010 WilliamRLBaker
Member since 2006 • 28915 Posts

[QUOTE="ronvalencia"] You are forgetting PS4's CPU IO is limited <20 GB/stormentos

 

Which is more than fine for that Jaguar...

Oh and are we forgeting the 10GB connection bentween GPU and CPU.?

 

Because the 20GB one is from the memory to the CPU.

 

Once again you on purpose ignoring one of the connetion of the CPU,so is 20GB from CPU to memory,10GB from CPU to GPU and 176GB from memory to GPU as well as 10GB from GPU to CPU.

 

lol El tormo trying to have a conversations with the adults BTK2k2 and Ronvalencia. here ya go.
Avatar image for ronvalencia
ronvalencia

29612

Forum Posts

0

Wiki Points

0

Followers

Reviews: 1

User Lists: 0

#1011 ronvalencia
Member since 2008 • 29612 Posts

I never said otherwise so why bring it up? I was not talking about any hardware to enable the paging. I was talking about how the dGPU memory pool and the main memory pool is viewed as one block of memory in software. That is a software construct because there is no other way to achieve it since they are sitting on different buses and are physically separated. I am sure that it does help to make HSA more useful on a dGPU but it still requires data to cross over PCIe which will incur a performance penalty.

btk2k2

With dGPU, HSA software stack uses the hardware PCIe features e.g. PCIe's BAR hardware is use to map GPU card's memory pool to CPU's address space. This is quite simple to understand.

Older PCIe standard has a fix size BAR. PCIe version 3 can resize the BAR for the entire GPU card's memory pool.

By default, PCIe version 3 operates in legacy mode i.e. it need new software to enable new hardware features.

I never said overall performance of dGPU is slower than an APU. I just said the APU is more efficient thanks to hUMA, if you had an APU with the same compute performance as a dGPU the APU would be faster at HSA than the dGPU. It also means that if your code requires a lot of back and forth between the CPU and the GPU the APU is likely to be faster than even a 7970 thanks to the lower overhead.

btk2k2

It depends on the CPU I/O. To arrive to your conclusion, what's your efficiency rating for PCIe 3.0? PS; I have another set of graphs for this case.

That is nice. Reducing overheads is great but reducing them is not the same as eliminating them and if you have two ways of doing something and one has more overheads than the other, regardless of overall performance, the one that has fewer overheads is more efficient. Given comparable compute characteristics the more efficient one will also be faster but the higher overhead way might enable a more brute force approach. Nothing you have posted shows that the statement "An APU with hUMA will be more efficient at HSA than a dGPU will be at HSA" is false. It does mean that there might be certain use cases that only work on an APU with hUMA because the overhead of using HSA on a dGPU is too high for it to work. It does not mean an APU will always be faster than a dGPU because a dGPU can brute force the performance if the use case is suitable. I am sure that you will bring up something else but no matter what you say, the simple fact the the data has to travel a longer distance means there is more latency because the physical electrons in the wires have a finite speed. Then there is the software induced latency which can be made more efficient but where do you think the largest gains have been made so far?

a) CPU(APU) --> main memory b) dGPU --> main memory c) CPU --> dGPU memory

btk2k2

To arrive to your conclusion, what's your efficiency rating for PCIe 3.0?

AMD Trinity APU's PCIe 2.0 has reduced hardware feature set compared to PCIe 3.0.

Avatar image for ronvalencia
ronvalencia

29612

Forum Posts

0

Wiki Points

0

Followers

Reviews: 1

User Lists: 0

#1012 ronvalencia
Member since 2008 • 29612 Posts

[QUOTE="ronvalencia"]

[QUOTE="btk2k2"]

On APU's shared memory controllers, the memory modules are seperated by memory channels.

======== Crossbar ============

|MCH 1| |MCH 2| ..... |MCH n|

....

With dGPU GCN.

======== Crossbar ========= |PCI 3.0 host|

|MCH 1|, |MCH 2| ..... |MCH n| , |paging va PCIe|

PCIe Atomic RMW is defined as a symmetric capability i.e. it supports operations: IO<->HostMem, HostCPU<->IO, IO<->IO

HostCPU<->IO is similar to HostCPU<->hostMCH

IO<->HostMem is similar to hostMCH<->hostMCH

btk2k2

PCIe 3.0 latency is the same as PCIe 2.0 latency. http://www.anandtech.com/show/5261/amd-radeon-hd-7970-review/10 snippets below. (Emphasis mine)
So why is PCIe 3.0 important then? Its not the games, its the computing. GPUs have a great deal of internal memory bandwidth (264GB/sec; more with cache) but shuffling data between the GPU and the CPU is a high latency, heavily bottlenecked process that tops out at 8GB/sec under PCIe 2.1. And since GPUs are still specialized devices that excel at parallel code execution, a lot of workloads exist that will need to constantly move data between the GPU and the CPU to maximize parallel and serial code execution. As it stands today GPUs are really only best suited for workloads that involve sending work to the GPU and keeping it there; heterogeneous computing is a luxury there isnt bandwidth for.

With PCIe 3.0 transport bandwidth is again being doubled, from 500MB/sec per lane bidirectional to 1GB/sec per lane bidirectional, which for an x16 device means doubling the available bandwidth from 8GB/sec to 16GB/sec. This is accomplished by increasing the frequency of the underlying bus itself from 5 GT/sec to 8 GT/sec, while decreasing overhead from 20% (8b/10b encoding) to 1% through the use of a highly efficient 128b/130b encoding scheme. Meanwhile latency doesnt change its largely a product of physics and physical distances but merely doubling the bandwidth can greatly improve performance for bandwidth-hungry compute applications.Anandtech

PCIe 3 is an improvement no doubt about it but compared to an APU with hUMA it is still full of bottlenecks and performance issues.

The benchamrk is not running AMD HSA software stack.

Avatar image for btk2k2
btk2k2

440

Forum Posts

0

Wiki Points

0

Followers

Reviews: 0

User Lists: 0

#1013 btk2k2
Member since 2003 • 440 Posts
The benchamrk is not running AMD HSA software stack.ronvalencia
The benchmark is not important although it does show a 9% performance improvement compared to PCIe 2 so PCIe 3 does obviously have some advantages. The important part was the text where he talks about latency and what AMD are hoping to achieve with HSA on dGPU. Another snippet. (Emphasis mine)
The long term solution of course is to bring the CPU and the GPU together, which is what Fusion does. CPU/GPU bandwidth just in Llano is over 20GB/sec, and latency is greatly reduced due to the CPU and GPU being on the same die. But this doesnt preclude the fact that AMD also wants to bring some of these same benefits to discrete GPUs, which is where PCI e 3.0 comes in.Anandtech
Now we both know that Llano does not have a hUMA so even though it shows improvements it still pretty much works the same way a dGPU does. AMD even state that they can only bring SOME of the APU advantages over to dGPUs which just further shows that an APU with hUMA will be more efficient and have more use cases for HSA.
Avatar image for btk2k2
btk2k2

440

Forum Posts

0

Wiki Points

0

Followers

Reviews: 0

User Lists: 0

#1014 btk2k2
Member since 2003 • 440 Posts
With dGPU, HSA software stack uses the hardware PCIe features e.g. PCIe's BAR hardware is use to map GPU card's memory pool to CPU's address space. This is quite simple to understand.

Older PCIe standard has a fix size BAR. PCIe version 3 can resize the BAR for the entire GPU card's memory pool.

By default, PCIe version 3 operates in legacy mode i.e. it need new software to enable new hardware features.ronvalencia

I have never said otherwise and I understand it just fine. HSA over a PCIe bus is going to have lower efficiency than hUMA because of physics. It is impossible for it to be the same or better because there are physical limitations on what the electrons themselves can do. On top of that you have more hardware to communicate with and you have the software itself. There is more latency with PCIe no matter how much they improve the software or improve the hardware because it is further away from the CPU(APU) If your dGPU is writing into the CPU cache the data has to travel from the dGPU cache, through the PCIe bus, into the CPU crossbar and into the cache. That physical distance has a latency penalty that will not exist on a hUMA APU because when the GPU in the APU writes to the CPU cache it just goes via the crossbar and into the CPU cache. Another quote from that article. (emphasis mine)
The long term solution of course is to bring the CPU and the GPU together, which is what Fusion does. CPU/GPU bandwidth just in Llano is over 20GB/sec, and latency is greatly reduced due to the CPU and GPU being on the same die. But this doesnt preclude the fact that AMD also wants to bring some of these same benefits to discrete GPUs, which is where PCI e 3.0 comes in.Anandtech
Even without other overheads there is a physical limit to the PCIe latency that is much greater than the latency of an APU. hUMA will provide further improvements when it comes to the CPU and GPU accessing data in the main memory pool. Currently this is cumbersome even with the HSA software stack but hUMA greatly improves it from the hardware side.

Avatar image for ronvalencia
ronvalencia

29612

Forum Posts

0

Wiki Points

0

Followers

Reviews: 1

User Lists: 0

#1015 ronvalencia
Member since 2008 • 29612 Posts

[QUOTE="ronvalencia"]

[QUOTE="btk2k2"]

On APU's shared memory controllers, the memory modules are seperated by memory channels.

======== Crossbar ============

|MCH 1| |MCH 2| ..... |MCH n|

....

With dGPU GCN.

======== Crossbar ========= |PCI 3.0 host|

|MCH 1|, |MCH 2| ..... |MCH n| , |paging va PCIe|

PCIe Atomic RMW is defined as a symmetric capability i.e. it supports operations: IO<->HostMem, HostCPU<->IO, IO<->IO

HostCPU<->IO is similar to HostCPU<->hostMCH

IO<->HostMem is similar to hostMCH<->hostMCH

btk2k2

PCIe 3.0 latency is the same as PCIe 2.0 latency. http://www.anandtech.com/show/5261/amd-radeon-hd-7970-review/10 snippets below. (Emphasis mine)
So why is PCIe 3.0 important then? Its not the games, its the computing. GPUs have a great deal of internal memory bandwidth (264GB/sec; more with cache) but shuffling data between the GPU and the CPU is a high latency, heavily bottlenecked process that tops out at 8GB/sec under PCIe 2.1. And since GPUs are still specialized devices that excel at parallel code execution, a lot of workloads exist that will need to constantly move data between the GPU and the CPU to maximize parallel and serial code execution. As it stands today GPUs are really only best suited for workloads that involve sending work to the GPU and keeping it there; heterogeneous computing is a luxury there isnt bandwidth for.

With PCIe 3.0 transport bandwidth is again being doubled, from 500MB/sec per lane bidirectional to 1GB/sec per lane bidirectional, which for an x16 device means doubling the available bandwidth from 8GB/sec to 16GB/sec. This is accomplished by increasing the frequency of the underlying bus itself from 5 GT/sec to 8 GT/sec, while decreasing overhead from 20% (8b/10b encoding) to 1% through the use of a highly efficient 128b/130b encoding scheme. Meanwhile latency doesnt change its largely a product of physics and physical distances but merely doubling the bandwidth can greatly improve performance for bandwidth-hungry compute applications.Anandtech

PCIe 3 is an improvement no doubt about it but compared to an APU with hUMA it is still full of bottlenecks and performance issues.

Anandtech's latency includes 8K x 8K Image Encrypt (AESEncryptDecrypt) processing latency i.e. compute resources being consumed. You haven't isolated pure PCIe latency.

DirectCompute's 60 ms latency.

2u8l3zp_zps7161951e.jpg

Hardware latency and bandwidth

Figure_1_20101028110158_zps4cd255e5.png

From http://www.realworldtech.com/fusion-llano/2/

Notebook AMD Llano has 10.4GB/s read + 10.4GB/s write onion links while PS4 has 10 GB/s read + 10 GB/s write onion/onion+ links.

Avatar image for tormentos
tormentos

33784

Forum Posts

0

Wiki Points

0

Followers

Reviews: 0

User Lists: 0

#1016 tormentos
Member since 2003 • 33784 Posts

 

hUMA is about hitting RAM i.e. a puzzle located in RAM.

ronvalencia

 

hUMA is about having a fu**ING single pool of ram not 2 period AMD say good bye..

Avatar image for tormentos
tormentos

33784

Forum Posts

0

Wiki Points

0

Followers

Reviews: 0

User Lists: 0

#1017 tormentos
Member since 2003 • 33784 Posts

[QUOTE="ronvalencia"]With dGPU, HSA software stack uses the hardware PCIe features e.g. PCIe's BAR hardware is use to map GPU card's memory pool to CPU's address space. This is quite simple to understand.

Older PCIe standard has a fix size BAR. PCIe version 3 can resize the BAR for the entire GPU card's memory pool.

By default, PCIe version 3 operates in legacy mode i.e. it need new software to enable new hardware features.btk2k2

I have never said otherwise and I understand it just fine. HSA over a PCIe bus is going to have lower efficiency than hUMA because of physics. It is impossible for it to be the same or better because there are physical limitations on what the electrons themselves can do. On top of that you have more hardware to communicate with and you have the software itself. There is more latency with PCIe no matter how much they improve the software or improve the hardware because it is further away from the CPU(APU) If your dGPU is writing into the CPU cache the data has to travel from the dGPU cache, through the PCIe bus, into the CPU crossbar and into the cache. That physical distance has a latency penalty that will not exist on a hUMA APU because when the GPU in the APU writes to the CPU cache it just goes via the crossbar and into the CPU cache. Another quote from that article. (emphasis mine)
The long term solution of course is to bring the CPU and the GPU together, which is what Fusion does. CPU/GPU bandwidth just in Llano is over 20GB/sec, and latency is greatly reduced due to the CPU and GPU being on the same die. But this doesnt preclude the fact that AMD also wants to bring some of these same benefits to discrete GPUs, which is where PCI e 3.0 comes in.Anandtech
Even without other overheads there is a physical limit to the PCIe latency that is much greater than the latency of an APU. hUMA will provide further improvements when it comes to the CPU and GPU accessing data in the main memory pool. Currently this is cumbersome even with the HSA software stack but hUMA greatly improves it from the hardware side.

 

He will not admit it he is to hard headed..

 

True HSA requires hUMA if it din't HSA would be on PC for a long time already,and Ron ignores that he believe that because AMD say something is compatible it means it is 100% the same.

 

HSA is not on PC because hUMA isn't there,and it will not work as intended on current setups because 2 different pools of memory with different speed are present.

Ron doesn't want to lose the argument but he loss since the moment i posted AMD comments on HSA.

Avatar image for ronvalencia
ronvalencia

29612

Forum Posts

0

Wiki Points

0

Followers

Reviews: 1

User Lists: 0

#1018 ronvalencia
Member since 2008 • 29612 Posts

[QUOTE="ronvalencia"]The benchamrk is not running AMD HSA software stack.btk2k2
The benchmark is not important although it does show a 9% performance improvement compared to PCIe 2 so PCIe 3 does obviously have some advantages. The important part was the text where he talks about latency and what AMD are hoping to achieve with HSA on dGPU. Another snippet. (Emphasis mine)
The long term solution of course is to bring the CPU and the GPU together, which is what Fusion does. CPU/GPU bandwidth just in Llano is over 20GB/sec, and latency is greatly reduced due to the CPU and GPU being on the same die. But this doesnt preclude the fact that AMD also wants to bring some of these same benefits to discrete GPUs, which is where PCI e 3.0 comes in.Anandtech
Now we both know that Llano does not have a hUMA so even though it shows improvements it still pretty much works the same way a dGPU does. AMD even state that they can only bring SOME of the APU advantages over to dGPUs which just further shows that an APU with hUMA will be more efficient and have more use cases for HSA.

You haven't isolated pure PCIe latency i.e. Anandtech's latency includes 8K x 8K Image Encrypt (AESEncryptDecrypt) processing latency.

PCIe version 3.0 has 20 percent overhead.

4882_03_lsi_9207_8i_mustang_raid_control

On PCIe version 3.0 16 lanes with 20 percent overhead, that would yield about ~25.6 GB/s (full duplex).

Avatar image for tormentos
tormentos

33784

Forum Posts

0

Wiki Points

0

Followers

Reviews: 0

User Lists: 0

#1019 tormentos
Member since 2003 • 33784 Posts

 

You haven't isolated pure PCIe latency i.e. Anandtech's latency includes 8K x 8K Image Encrypt (AESEncryptDecrypt) processing latency.

 

PCIe version 3.0 has 20 percent overhead.

 

 

 

 

On PCIe version 3.0 16 lanes with 20 percent overhead, that would yield about ~25.6 GB/s (full duplex).

 

ronvalencia

 

 

Give it up HSA will not work as with hUMA over a PCIE bus with 2 very different memory pools period,HSA will help current setups but will not perform like HSA on APU which have the CPU and GPU on the same die with a unified memory address which both the CPU and GPU have access at the same sime,the APU is more efficient period.

Avatar image for ronvalencia
ronvalencia

29612

Forum Posts

0

Wiki Points

0

Followers

Reviews: 1

User Lists: 0

#1020 ronvalencia
Member since 2008 • 29612 Posts

[QUOTE="ronvalencia"]

You haven't isolated pure PCIe latency i.e. Anandtech's latency includes 8K x 8K Image Encrypt (AESEncryptDecrypt) processing latency.

PCIe version 3.0 has 20 percent overhead.

On PCIe version 3.0 16 lanes with 20 percent overhead, that would yield about ~25.6 GB/s (full duplex).

tormentos

Give it up HSA will not work as with hUMA over a PCIE bus with 2 very different memory pools period,HSA will help current setups but will not perform like HSA on APU which have the CPU and GPU on the same die with a unified memory address which both the CPU and GPU have access at the same sime,the APU is more efficient period.

Sorry, AMD bundled "Paging over PCI-Express (discrete)" and "shared memory controller" (APU) as solutions against Copy Overhead period.

2q3ucg7_zpscd169195.jpg

From http://www.realworldtech.com/fusion-llano/2/

contrast, system memory is optimized for latency and locality; contiguous requests will tend to stay to one memory channel and keep DRAM pages open

Even with APUs, data has locality bias within the memory channel period.

Avatar image for ronvalencia
ronvalencia

29612

Forum Posts

0

Wiki Points

0

Followers

Reviews: 1

User Lists: 0

#1021 ronvalencia
Member since 2008 • 29612 Posts

[QUOTE="ronvalencia"]With dGPU, HSA software stack uses the hardware PCIe features e.g. PCIe's BAR hardware is use to map GPU card's memory pool to CPU's address space. This is quite simple to understand.

Older PCIe standard has a fix size BAR. PCIe version 3 can resize the BAR for the entire GPU card's memory pool.

By default, PCIe version 3 operates in legacy mode i.e. it need new software to enable new hardware features.btk2k2

I have never said otherwise and I understand it just fine. HSA over a PCIe bus is going to have lower efficiency than hUMA because of physics. It is impossible for it to be the same or better because there are physical limitations on what the electrons themselves can do. On top of that you have more hardware to communicate with and you have the software itself. There is more latency with PCIe no matter how much they improve the software or improve the hardware because it is further away from the CPU(APU) If your dGPU is writing into the CPU cache the data has to travel from the dGPU cache, through the PCIe bus, into the CPU crossbar and into the cache. That physical distance has a latency penalty that will not exist on a hUMA APU because when the GPU in the APU writes to the CPU cache it just goes via the crossbar and into the CPU cache. Another quote from that article. (emphasis mine)
The long term solution of course is to bring the CPU and the GPU together, which is what Fusion does. CPU/GPU bandwidth just in Llano is over 20GB/sec, and latency is greatly reduced due to the CPU and GPU being on the same die. But this doesnt preclude the fact that AMD also wants to bring some of these same benefits to discrete GPUs, which is where PCI e 3.0 comes in.Anandtech
Even without other overheads there is a physical limit to the PCIe latency that is much greater than the latency of an APU. hUMA will provide further improvements when it comes to the CPU and GPU accessing data in the main memory pool. Currently this is cumbersome even with the HSA software stack but hUMA greatly improves it from the hardware side.

http://www.whatmannerofburgeristhis.com/blog/gcn-opencl-memory-fences-update-and-inline-ptx/

Use atomics to bypass the L1 cache if you need strong memory consistency across workgroups. This is an option for reads that aren't very critical. This was true for one of the N-body kernels. For another it was many times slower than running a single workgroup at time to ensure global consistency.

Avatar image for tormentos
tormentos

33784

Forum Posts

0

Wiki Points

0

Followers

Reviews: 0

User Lists: 0

#1023 tormentos
Member since 2003 • 33784 Posts

 

Sorry, AMD bundled "Paging over PCI-Express (discrete)" and "shared memory controller" (APU) as solutions against Copy Overhead period.

 

 

From http://www.realworldtech.com/fusion-llano/2/

 

contrast, system memory is optimized for latency and locality; contiguous requests will tend to stay to one memory channel and keep DRAM pages open

 

 

Even with APUs, data has locality bias within the memory channel period.

 

 

ronvalencia

 

 

First off, can you give us a run-down of what makes HSA so important?

In a PC, the CPU is good for serial workloads and the GPU is good for parallel workloads. You need a good balance and have to put the right work load on the right piece of hardware. When we talk about HSA, you see that the memories are shared, so the argument won't be whether you're using the CPU or the GPU part of a chip, but what's the end experience. Applications written for HSA will use both, speeding them up compared to traditional CPU-driven workloads. The difference can be huge; what we're saying is that the CPU isn't the most important part of a system any more. If you want massive performance gains, you need to enable the GPU and use both.

 

2psknpy.png

http://www.expertreviews.co.uk/processors/1299913/the-big-interview-apus-hsa-and-where-next-for-amd

 

 

Give it a rest.

 

szfn1t.jpg

 

 

AMD aims to address this with its heterogeneous Uniform Memory Access (hUMA) technology which builds upon HSA, the intelligent computing architecture utilized in the companys APUs that that enables CPU, GPU and other processors to work in harmony on a single piece of silicon by seamlessly moving the right tasks to the best suited processing element.


http://www.tomshardware.com/news/AMD-HSA-hUMA-APU,22324.html


HSA is a design which need hUMA,current setups can take advantage of HSA,but will not work like the true HSA,and no one of thise article quote current setups as working like APU with hUMA.

Avatar image for wis3boi
wis3boi

32507

Forum Posts

0

Wiki Points

0

Followers

Reviews: 2

User Lists: 0

#1024 wis3boi
Member since 2005 • 32507 Posts

this has to be  the worst conversation SW has ever spawned.  Two people bickering back and forther over meaningless garbage and neither one knows what the hell they are talking about

Avatar image for mitu123
mitu123

155290

Forum Posts

0

Wiki Points

0

Followers

Reviews: 32

User Lists: 0

#1025 mitu123
Member since 2006 • 155290 Posts

this has to be  the worst conversation SW has ever spawned.  Two people bickering back and forther over meaningless garbage and neither one knows what the hell they are talking about

wis3boi
Now that's quite an accomplishment in itself.
Avatar image for ronvalencia
ronvalencia

29612

Forum Posts

0

Wiki Points

0

Followers

Reviews: 1

User Lists: 0

#1026 ronvalencia
Member since 2008 • 29612 Posts

szfn1t.jpg

tormentos

:lol: You posting something that you don't know what you are talking about.


PCI-E has hardware cache/memory coherent support.


"Today's PCIe requests are coherent with respect to system memory/caches" - PCI-SIG

----------

http://www.eetimes.com/document.asp?doc_id=1163775

Intel/IBM Geneseo (extensions to PCI-E) was designed to counter AMD's Torrenza program.

There's a relationship between latency and bandwidth. For example

Latency plays an important role in PCI Express* performance and should be

considered in relationship to bandwidth. For example, if the round trip

latency for a read of 128 bytes is 400ns, then the read bandwidth would be

2.5MB/second (assuming one outstanding transaction). If the latency is

increased to 800ns, then the read bandwidth decreases to 1.25MB/second, a

one to one relationship.

http://www.intel.com.au/content/dam/www/public/us/en/documents/white-papers/pcie-hw-level-io-benchmarking-paper.pdf

Latencies reduces an I/O's bandwidth potential.

Avatar image for ronvalencia
ronvalencia

29612

Forum Posts

0

Wiki Points

0

Followers

Reviews: 1

User Lists: 0

#1027 ronvalencia
Member since 2008 • 29612 Posts

this has to be  the worst conversation SW has ever spawned.  Two people bickering back and forther over meaningless garbage and neither one knows what the hell they are talking about

wis3boi
Can you do better?