Tuesday, April 12, 2016

Computer Architecture

Struct and unions in C
- union is a user defined data type to store values of different data types in same memory location.

MESI v MoESI
- O-owner helps to reduce memory accesses because one cache contains the most recent data and all other caches can take data from it.


How to swap two bits in an integer         
- using xor


How to reverse bits in a X bit number
- lookup table of reversed indexing bits
- continues OR operation with half the bits swapped


why is branch delay slot necessary
- a delay slot is an instruction slot which gets executed without the effects of preceding instructions.
- The point of delay slot specifically is to execute an instruction that has already made it through part of pipeline and is now in slot that would otherwise just have to be thrown away.

what happens if you forward data to execute stage rather than decode stage
- will not have to stall the pipeline and next instruction can receive the data just before execution.

4 way set associative vs 16 way set associative. What will consume more power and what will be slower
- fewer the total no. of sets you need to search through, the less overall hardware is needed.
- simultaneouly allowing N cache lines per slot for N way set associate cache reduces the misses.
- so for 16 way set associate cache will use more hardware for comparison and thus more power for same total size of cache as compares to 4 way set associate cache.
- but 16 way set associate cache has fewer collisions as there are now more slots to pick from.

architectural reg file vs physical reg file.
- the architected registers and rename registers can be pooled together to form a single physical reg file.s


OoO execution and in order and the differences of that
Topology if one of the input is faster than other. Passgate, CMOS, domino etc

What is the problem with delay slots? - Hard to find independent instructions. If in future pipeline changes, then software will change

what is predicated execution?
- predicated execution avoids branches, and simplifies compiler optimizations by converting a control dependence to a data dependence.
- replaces branch prediction by allowing the CPU to execute all possible branch paths in parallel.

Memory
http://www.barrgroup.com/Embedded-Systems/How-To/Memory-Types-RAM-ROM-Flash


Digital circuit Design

This is the second post of 'Lets's do it' series. It talks about question asked in digital circuit design interview.

a reference link for questions - http://www.techinterviews.com/large-list-of-intel-interview-questions

Key difficulty in voltage scaling
- Need to scale voltage for power
- hard since Vth does not scale.

Silicide
- it is a metal - silicon alloy

Power
Ptotal = Pdyn + Pstat
Pdyn = Ptran + Psc ( sc - short circuit)

Ptot = Pdyn + Psc + Pstat
Ptot = CVcc(2)f + VccImax(tr+tf)(1/2)f + VccIleak

Reducing dynamic power
- reduce voltage
- reduce capacitance
- reduce frequency

Reducing short circuit current
- fast rise and fall times on input signals
- - reduce input capacitance
- - insert small buffers to "clean up" slow input signals before sending to large gate
Large output load, doesn't let the output to switch quickly so there is less time for both transistors to be ON at the same time.
High threshold voltage can also help in reducing the short circuit current.

Reducing leakage current
- reduce transistors size
- lower voltage
- increase threshold voltage ( to reduce leakage through gate)
- - high-K dielectric replacing SiO2
- reduce temperature by reducing power by reducing frequency/Area/f/Cap

**Sub-threshold leakage- when transistor is OFF due to Vds, a current flows. It depends on threshold voltage, DIBL. Threshold voltage can be lowered by reducing the Vds or increasing the source voltage or increasing negative potential of bulk. This all reduces the subthreshold leakage.
This leakage increases with temperature because threshold voltage reduces with temperature.

Series transistors leak less because each NMOS will have less Vds across  drain and source and thus less DIBL and thus less leakage.( refer to diagram below)

    |Vdd
   _|
 ||
 ||_
    |
    |Vx
   _|
 ||
 ||_
    |
    |gnd

But other factors also come into picture. Raising Vt by controlling DIBL and short channel effects causes BTBT to increase. Applying a reverse body bias to increase Vt also increases BTBT.
Applying a negative gate voltage to turn the transistor OFF more strongly causes GIDL to increase.

** gate leakage - carriers tunnel through thin gate dielectric. strong function of dielectric thickness.
Gate leakage can be alleviated by stacking transistors such that the OFF transistors is closer to the rail.

why always P substrate
- Both P-well and N-well CMOS processes exist.The N-well process offers a slightly better NMOS transistor, and it allows the use of a grounded substrate.

NAND v/s NOR
- NOR occupies more area for same delay and current because it has transistors with double the size of transistors in NAND.

-NAND                  NOR
2    2                 4
           
  2                    4

  2                 1      1


-NAND uses transistors of similar sizes.
-NAND offers less delay

Two input mux using 4 NOR gates (assume inputs are available for negative signals too)

VLSI

I am writing this blog with the intension of consolidating most of the information required for interview preparation.

First post is for VLSI

Useful links:
http://www.asic-world.com/
http://asic-soc.blogspot.in/


Why power stripes routed in the top metal layers?
-The resistivity of top metal layers are less and hence less IR drop is seen in power distribution network. If power stripes are routed in lower metal layers this will use good amount of lower routing resources and therefore it can create routing congestion.

Definitions-
Technology files
- these files provide information regarding the type of silicon wafer used, standard cells used, layout rules (DRC) etc.

Synthesis
- it converts the RTL design coded in VHDL, verilog to gate level descrceptions which next set of tools can read/understand.
It contains information about cells used, their interconnections, area used and other details.

Placement
- Before start of placement, all wire load models are removed. Placement uses RC values from virtual route(VR) to calculate timing. VR is the shortest manhattan distance between two pinsVR RCs are more accurate than WLM RCs.

Pre-placement optimization
-optimizes the netlist before placement, HFNs are collapsed, can downsize the cells.

In-placement optimization
- uses manhattan distance (virtual route VR). does cell sizing, cell moving, cell bypassing, net splitting, gate duplication, buffer insertion, area recovery. fixes setup, incremental timing and congestion optimization.

post-placement before CTS
- netlist optimization with ideal clocks. fix setup/hold, trans/cap. placement optimization basesd on global routing. re does HFN synthesis.

Post placement after CTS
- optimizes timing with propagated clock. Tries to preserve clock skew.

Clock tree synthesis
- goal is to minimize skew and insertion delay.

Useful skew
- If clock is skewed intentionally to improve setup slack then it is known as useful skew.

CTO
- clock is shielded so that noise is not coupled to other signals. It increases area by 12-15%.
Because clock is global in nature, the same metal layer used for power routing is used for clock also.
After CTS hold slack is worked on and improved. As a result of CTS lot of buffers are added

Routing
- Global routing - allocates routing resources that are used for connection.
- detailed routing - assign routes to specific metal layers and routing tracks within the global routing resources.

Physical verification
- DRC - complies with the technology requirements.
- LVS - layout vs schematic
- antenna effects - Antenna rule checking
- density verification at full chip level.
- ERC - complies with electrical requirements.

.lib file
ASCII representation of the timing and power parameters associated with any cell in a particular semiconductor technology.
--
Cell description - function,timing, power etc. - strength, area, leakage power.
max cap for output pin
for output pin timing - rise delay, fall delay, rise transition, fall transition to a related input pin.
Cell Delay table with input trans and output cap

Wire Load model
- it contains information that synthesis tool utilizes to estimate interconnect wiring delays during logic synthesis phase of the design.

Operating conditions
- environmental variations of IC. Process, voltage and Temperature
A set of values of PVT is called operating condition.

scaling factors (also called as K factor) are multipliers that provide flexibility for derating the delay values based on PVT. If PVT changes by a particular value then how to calculate parameters like cell delay or net delay. using this K factor can be accomplished.

Library level attributes - it explains technology type, date, revision. it also gives volts, cap, amps, time units used in the library.

 /* General Attributes */
  technology                        (cmos);
  delay_model                     : table_lookup;
  in_place_swap_mode              : match_footprint;
  library_features                  (report_delay_calculation,report_power_calculation);

  /* Units Attributes */
  time_unit                       : “1ns”;
  leakage_power_unit              : “1nW”;
  voltage_unit                    : “1V”;
  current_unit                    : “1mA”;
  pulling_resistance_unit         : “1kohm”;
  capacitive_load_unit              (1,fF);

-------------------------------------------
How delays are characterized using WLM (Wire Load Model)?
- For a given wireload model the delay are estimated based on the number of fanout of the cell driving the net.
Fanout vs net length is tabulated in WLMs.
Values of unit resistance R and unit capacitance C are given in technology file.
Net length varies based on the fanout number.
Once the net length is known delay can be calculated; Sometimes it is again tabulated.

STA
Source latency:
- The delay from the clock origin point to the clock definition point in the design.
It is the insertion delay external to the circuit which we are timing. It applies to only primary clocks.

Network Latency:
- The delay from the clock definition point to the clock pin of the register
It is the internal delay for the circuit which we are timing (the delay of the clock tree from the source of the clock to all of the clock sinks).

I/O latency
- If the flop of the block is talking with another flop outside the block, clock latency (network) of that flop will be the i/o latency of the block.

Global Skew
- difference between the max insertion delay and min insertion delay on any flops.
- It is also defined as the difference between the shortest clock path and longest clock path delay reaching two sequential elements.

Boundary skew
- It is defined as the difference between the max insertion delay and min insertion delay of boundary flops.

Recovery and Removal
- Timing checks for asynchronous signals
Recovery time is the minimum amount of time required between the release of an async signal and the active state of the next active clock edge.
Removal TIme - Minimum amount of time between an active clock edge and the release of an async control signal.

--------------------------------------------

Latch vs Flip-Flop

FF - edge sensitive
   - slower
   - immune to glitches
   -
Latch - level sensitive
      - faster
      - sensitive to glitches
      - take less gates to implement than flip flop
      - latch facilitate time borrowing or cycle stealing
        whiile flip flop allow synchronous logic
      - latches are not friendly with DFT tools

Lock-up latch
- It is an important element in scan-based designs, especially for hold timing closure of shift modes. Lock-up latches are necessary to avoid skew problems during shift phase of scan-based testing.

cross-clock domains
http://www.eetimes.com/document.asp?doc_id=1276114

Back Annotation

- This term is in general used in connection to netlist simulations and STA where the propagation delay(s)
through each cell in the netlist is overridden by the delay value(s) specified in a special file
called sdf(synopsys delay format) file. The process of putting delays from a given source for the
cells in a netlist during netlist simulation is called Back Annotation. Normally the values of the
delays corresponding to each cell in the netlist would come from the simulation library i.e verilog
model of library cells. But those delays are not the actual delays of cells, as each of them is instantiated
in a netlist in different surroundings, different physical locations, different loads, different fan in.
The delay of two similar cells in the netlist at two different physical locations in a chip can be significantly
different depending upon above said factors. Therefore in order to have actual delays for the cells
in your netlist, an SDF is written out, by a EDA tool can be a synthesis tool or a layout tool etc..
which contains the delays of each instance of each library cell  in the netlist, under the circumstances the cell is in.
During simulations or Static Timing Analysis, each cell in the netlist gets its correponding delay read, or more
technically 'annotated' from the SDF file.

SDF file contains the delay value of each timing arc
corresponding to each cell in the netlist. These delay values in the SDF file are extracted
under a given conditions of the netlist. It may be that the SDF corresponds to just an after
synthesis netlist, with wire loads estimated according to some wire load model, or it may
be that the SDF corresponds to a neltist which has been laid out, with actual position of cell,
actual load on the cell, actual metal wires connected to the cells.

SDF vs .lib

The .lib only has the cell delays in a table form, and the SPEF file has the interconnect parasitics. SDF file combines these information and gives out a file that has accurate delays for each component in the layout database, for the given constraints. This is used along with the netlist in a simulator to verify that design meets its functional & timing requirements.

What is temperature inversion
- change in temperature has effect on both vt and mobility in transistor. at technology 65nm and above, mobility effect dominates and vt does not because voltage is high so much that change in vt doesn't come into picture. So, at technology nodes 65nm and higher, the delay of the device increases with rise in temperature. But below 65nm, the vt effect dominates due to decrease in supply voltage. So, with rise in temperature, the delay decreases. This is temperature inversion.

Recovery
- minimum amount of time required between the release of an async signal from the active state to the next active clock edge.

Removal
- minimum amount of time between the active clock edge and the release of an asynchronous control signal.