Higher Intellect | preterhuman.net

Discussion => Computing, Internet & Technology => Topic started by: netfreak on December 09, 2012, 12:47:45 am

Title: Motorola 68040 info
Post by: netfreak on December 09, 2012, 12:47:45 am
68040 Info:

----------------------------
This new CISC microprocessor
offers RISC performance
----------------------------
 
Motorola has officially unwrapped its newest 32-bit
microprocessor, the 68040. Manufactured with 0.8-micron
high-speed CMOS technology, the 68040 packs 1.2 million
transistors on a single silicon die. With 900,000 extra
transistors to work with over the 300,000 transistors in a 68000
processor, the 68040's designers added new features and boosted
performance. New features include the following:


 
-- Optimised 68030 integer unit. While retaining object-code
compatibility with previous 68000-family processors, the IU has
been optimised to execute instructions in fewer clock cycles
(i.e., run faster). The claimed boost in performance is three
times that of a 68030.
 
-- Integral FPU. The 68020 and 68030 require external FPU
coprocessor chips to handle floating-point math. The 68040,
however, has an FPU built into it, giving it the power to do
serious number crunching. The FPU's data types are compatible
with the ANSI/IEEE 754 standard for binary floating-point math,
and its instruction set is object code-compatible with Motorola's
68881/68882 FPUs. Like the IU, the 68040's on-chip FPU has been
optimised to execute frequently used instructions using fewer
clock cycles. The claimed performance boost is 10 times that of a
68882.
 
-- Large caches. Processor accesses to the system bus are
minimised by storing the most recently used set of instructions
or data in on-chip, 4K-byte caches. Both caches operate
independently but can be accessed at the same time. Bus snoop
logic is used to maintain cache coherence (i.e., it ensures that
the cache's contents match those parts of memory corresponding to
the cache). The bus snooper's design is fine-tuned to support
multiprocessor systems where one or more bus masters or 68040s
might share the same section of memory.
 
-- Separate memory units for instructions and data. Each memory
unit consists of a memory management unit, a cache controller,
and bus snoop logic. The MMUs use a subset of the 68030's MMU
instruction set. Both memory units function independently of each
other to improve processor throughput.
 
The 68040 ships with an initial clock speed of 25MHz; higher
speeds are to be available in the future, Motorola says. The
68040 comes in a 179-pin grid-array package. With the elimination
of coprocessor function lines (now that the MMU and FPU are
consolidated onto the processor) and the addition of snoop
control lines, the 68040 is not pin-compatible with the 68030.
 
Because of the 68040's software compatibility with its
predecessors, it can tap into the existing software base of 680x0
applications. It does this not only while eliminating a component
(the FPU) from a computer's design, but also while improving
performance. In fact, the 68040 executes instructions on the
average of nearly once per clock cycle -- the same as a RISC
processor.

 
Fine-Tuned for Performance
 
The 68040 was built on the firm foundation of its
predecessors. The design team used the experience garnered from
developing earlier processors to aid in optimising the throughput
of the 040.
 
The 040 was designed from the ground up, Motorola engineers
said. It incorporates a high degree of parallelism using a number
of internal buses. An internal Harvard architecture gives the
processor full access to both instructions and data. Both the IU
and FPU have separate pipelines and can operate concurrently. For
example, the FPU can perform floating-point instructions
independently of the IU. Each stream (instructions or data) has
its own dedicated cache and MU that function independently of
each other. A smart bus controller assigns priorities to bus
traffic to and from the caches.
 
There were several key areas where Motorola was able to
boost performance. The first was in reducing the clock cycles
needed to execute certain instructions. The next was to ensure
that the processor funnels instructions and data into itself
quickly and constantly, lest it stall while waiting on
information. The processor then gets its results back into the
system without interfering with incoming information. Finally, as
if this wasn't enough, the processor stays off the system bus to
a greater extent than is the case with other processor designs.
This lets DMA transfers and other bus masters have use of it.


CISC with the Speed of RISC
 
The IU was optimised so that high-usage instructions execute
in fewer clock cycles, particularly branch instructions. Motorola
said it performed thousands of code traces using real-world
applications to determine which instructions were used most
often. The IU consists of 6 stages: instruction prefetch, decode,
effective address calculation, operand fetch, execution, and
writeback (i.e., the result is written to either a register or to
memory). Each stage works concurrently on the instruction
pipeline. Dual prefetch and decode units deal with the branch
instructions: One set processes the instruction taken on the
branch, and another processes the instruction not taken. In this
way, no matter what the outcome, the IU has the next instruction
decoded and ready to go without seriously disrupting the
pipeline. This complex design has a big pay-off: Motorola has
determined that the average instruction takes 1.3 clock cycles to
execute. The ability to execute an instruction once per clock
cycle is the performance edge of RISC processors -- yet the
68040's IU accomplishes the same goal while executing
complex-instruction-set computer (CISC) instructions.
 
The FPU adds 11 registers to the 68040 register set: Eight
of them are 80-bit floating-point registers, and three are
status, control, and instruction address registers. The FPU has a
three-stage execution unit, and, like the IU, each stage operates
concurrently. Load and store instructions (FMOVE) can be
performed during other arithmetic operations, and a 64- by 8-bit
hardware multiplication unit speeds many calculations. However,
the FPU only implements a subset of the 68882 instructions
on-chip. The transcendental (trigonometric and exponential)
functions are emulated in software via a software trap. But
Motorola claims that even these instructions should execute 25%
to 100% faster on 25MHz 68040 than on a 33MHz 68882 FPU.


Boosting Throughput
 
In the area of throughput, each stream is managed by a
separate memory unit that uses an MMU for logical-to-physical
address translations during bus accesses. These MMUs support
demand-paged virtual memory. Both MMUs have a four-way
set-associative address translation cache (ATC) with 4 entries
(versus 22 entries for the 68030). The ATCs reduce processor
overhead by storing the most recent address translations. When an
address translation is required, the ATC is searched, and if it
contains the address, it is used immediately. Otherwise, a
combination of high-speed hardware logic and microcode searches
the translation tables located in main memory.
 
Like the PU, these MMUs implement a subset of the 68030's
MMU instruction set. Gone are the PLOAD and PMOVE instructions,
because enhanced existing instructions made them superfluous.
Also, only 2 memory page sizes are supported, 4K and 8K bytes,
whereas the 68030 MMU supported 8 page sizes ranging from 256
bytes to 32K bytes. A design tradeoff was made here: A
performance gain was possible by supporting only the 2 most
common page sizes. In any case, this change impacts only
operating-system code, since MMU instructions aren't normally
used by applications.
 
The two on-chip 4K caches improve processor throughput in 2
ways: They keep the pipelines filled and minimise system bus
accesses. To see how this is done, you must examine the structure
of the cache. Each is a four-way set-associative cache composed
of 64 sets of four lines. A line consists of 4 longwords, or 16
bytes. Cache lines are read or written rapidly using burst-mode
access (a type of bus transfer that moves 16 bytes in a minimum
of clock cycles). For read operations, this fills the cache
efficiently and, at the same time, loads adjacent instructions or
data into the cache that could be used in the near future.

 
Zen and the Art of Cache Maintenance
 
As the cache is accessed and data modified, cache-mode bits
in the ATC determine, on a page-by-page basis, the method by
which the information is handled. That is, the ATC entry that
corresponds to the address in main memory whose contents were
copied into the cache decides how the data will be updated. The
modes are cacheable write-through, cacheable copyback,
noncacheable, and noncacheable I/O.
 
In the cacheable write-through mode, an update to the data
cache forces a write to main memory. While this generates
additional bus activity, this mode is required when working with
a portion of memory that other processors share. The copyback
mode updates the cache line but without updating main memory. The
modified (or "dirty") cache line is copied back into main memory
only when absolutely necessary. "Noncacheable" indicates that the
data shouldn't be cached, which is typically the situation for
shared data structures or for locked accesses (e.g., an operand
access or a translation table entry update). Noncacheable I/O
indicates that the data can't be cached and must be read or
written in the exact order of instruction execution. This mode is
for memory-mapped I/O devices (typically a serial device) where
the information's order is crucial.
 
The bus snooper is used in multiple bus master situations
where a noncaching bus master, such as a DMA controller, might
modify the memory that is mapped into the 68040's cache. The bus
snooper monitors the external bus and updates the cache as
required.

Cache validity is handled on a line-by-line basis (i.e., a
cache miss triggers a burst-mode access that updates 16 bytes
either in the cache or main memory). The copyback mode minimises
writes to main memory, and the bus controller prioritises each
cache's external memory requests. Read requests take priority
over writes to ensure that the pipelines remain filled.
 
The caches are critical to the 040's overall throughput.
They keep instructions and data moving into the processor while
satisfying the apparently contradictory role of minimising system
bus accesses. Motorola estimates that the cache hit rate is about
93 percent for instruction and data reads and about 94 percent
for data writes.
 

A Processor for the 1990's
 
It is perhaps appropriate that Motorola has introduced the
68040 in the first month of the 1990s. The 040 has the power to
tackle the jobs with large amounts of information that we will be
dealing with regularly in the next ten or so years.
 
Preliminary results have a 68040 weighing in at 20 million
instructions per second versus the SPARC's 18 MIPS and the
80486's 15 MIPS, all clocked at 25MHz. On floating-point
operations, the 68040 antes up 3.5 million floating-point
operations per second versus the SPARCS's 2.6 MFLOPS and the
80486's 1 MFLOPS. If these numbers are accurate, then the 68040
already out performs one RISC processor.
 
But the computer industry doesn't stand still. As we move
into the new decade, we can expect new RISC processors to once
again take the lead in performance. Still, the 68040 shows that
owners of CISC systems can have their cake and eat it, too. They
don't have to forsake their software base or settle for mediocre
performance.
 
Original:
http://preterhuman.net/texts/computing/general/68040.txt