hardware implementation of data compression ... - INFN Bologna

UNIVERSITÀ DEGLI STUDI DI BOLOGNA 

FACOLT À DI SCIENZE MATEMATICHE FISICHE E NATURALI 

DOTTORATO DI RICERCA IN FISICA XIV ciclo 

HARDWARE IMPLEMENTATION OF 

DATA COMPRESSION ALGORITHMS 

IN THE ALICE EXPERIMENT 

Tesi di Dottorato 

di: 

Dott. Davide Falchieri 

Anno Accademico 2000/2001 

Tutori: 

Prof. Maurizio Basile 

Prof. Enzo Gandolfi 

Coordinatore: 

Prof. Giovanni Venturi

UNIVERSITÀ DEGLI STUDI DI BOLOGNA 

FACOLT À DI SCIENZE MATEMATICHE FISICHE E NATURALI 

DOTTORATO DI RICERCA IN FISICA XIV ciclo 

HARDWARE IMPLEMENTATION OF 

DATA COMPRESSION ALGORITHMS 

IN THE ALICE EXPERIMENT 

Tesi di Dottorato 

di: 

Dott. Davide Falchieri 

Tutori: 

Prof. Maurizio Basile 

Prof. Enzo Gandolfi 

Coordinatore: 

Prof. Giovanni Venturi 

Parole chiave: ALICE, data compression, CARLOS, wavelets, VHDL 

Anno Accademico 2000/2001

Contents 

Introduction ix 

1 The ALICE experiment 1 

1.1 The Inner Tracking System . . . . . . . . . . . . . . . . . . . 2 

1.1.1 Tracking in ALICE . . . . . . . . . . . . . . . . . . . . 3 

1.1.2 Physics of the ITS . . . . . . . . . . . . . . . . . . . . 4 

1.1.3 Layout of the ITS . . . . . . . . . . . . . . . . . . . . . 6 

1.2 Design of the drift layers . . . . . . . . . . . . . . . . . . . . . 8 

1.3 The SDDs (Silicon Drift Detectors) . . . . . . . . . . . . . . . 10 

1.4 SDD readout system . . . . . . . . . . . . . . . . . . . . . . . 12 

1.4.1 Front-end module . . . . . . . . . . . . . . . . . . . . . 14 

1.4.2 Event-buffer strategy . . . . . . . . . . . . . . . . . . . 17 

1.4.3 End-ladder module . . . . . . . . . . . . . . . . . . . . 18 

1.4.4 Choice of the technology . . . . . . . . . . . . . . . . . 19 

2 Data compression techniques 21 

2.1 Applications of data compression . . . . . . . . . . . . . . . . 22 

2.2 Remarks on information theory . . . . . . . . . . . . . . . . . 23 

2.3 Compression techniques . . . . . . . . . . . . . . . . . . . . . 24 

2.3.1 Lossless compression . . . . . . . . . . . . . . . . . . . 25 

2.3.2 Lossy compression . . . . . . . . . . . . . . . . . . . . 25 

2.3.3 Measures of performance . . . . . . . . . . . . . . . . . 25 

2.3.4 Modelling and coding . . . . . . . . . . . . . . . . . . . 26 

2.4 Lossless compression techniques . . . . . . . . . . . . . . . . . 27 

2.4.1 Huffman coding . . . . . . . . . . . . . . . . . . . . . . 27 

v

vi 

CONTENTS 

2.4.2 Run Length encoding . . . . . . . . . . . . . . . . . . . 31 

2.4.3 Differential encoding . . . . . . . . . . . . . . . . . . . 32 

2.4.4 Dictionary techniques . . . . . . . . . . . . . . . . . . . 33 

2.4.5 Selective readout . . . . . . . . . . . . . . . . . . . . . 34 

2.5 Lossy compression techniques . . . . . . . . . . . . . . . . . . 35 

2.5.1 Zero supression . . . . . . . . . . . . . . . . . . . . . . 35 

2.5.2 Transform coding . . . . . . . . . . . . . . . . . . . . . 36 

2.5.3 Subband coding . . . . . . . . . . . . . . . . . . . . . . 41 

2.5.4 Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . 45 

2.6 Implementation of compression algorithms . . . . . . . . . . . 51 

3 1D compression algorithm and implementations 55 

3.1 Compression algorithms for SDD . . . . . . . . . . . . . . . . 55 

3.2 1D compression algorithm . . . . . . . . . . . . . . . . . . . . 56 

3.3 1D algorithm performances . . . . . . . . . . . . . . . . . . . . 58 

3.3.1 Compression coefficient . . . . . . . . . . . . . . . . . . 59 

3.3.2 Reconstruction error . . . . . . . . . . . . . . . . . . . 60 

3.4 CARLOS v1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 

3.4.1 Board description . . . . . . . . . . . . . . . . . . . . . 62 

3.4.2 CARLOS v1 design flow . . . . . . . . . . . . . . . . . 65 

3.4.3 Functions performed by CARLOS v1 . . . . . . . . . . 67 

3.4.4 Tests performed on CARLOS v1 . . . . . . . . . . . . 68 

3.5 CARLOS v2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 

3.5.1 The firstcheck block . . . . . . . . . . . . . . . . . . . 71 

3.5.2 The barrel shifter block . . . . . . . . . . . . . . . . . . 72 

3.5.3 The fifo block . . . . . . . . . . . . . . . . . . . . . . . 73 

3.5.4 The event-counter block . . . . . . . . . . . . . . . . . 75 

3.5.5 The outmux block . . . . . . . . . . . . . . . . . . . . 76 

3.5.6 The feesiu (toplevel) block . . . . . . . . . . . . . . . . 81 

3.5.7 CARLOS-SIU interface . . . . . . . . . . . . . . . . . . 82 

3.6 CARLOS v2 design flow . . . . . . . . . . . . . . . . . . . . . 87 

3.7 Tests performed on CARLOS v2 . . . . . . . . . . . . . . . . . 89

CONTENTS 

4 2D compression algorithm and implementation 91 

4.1 2D compression algorithm . . . . . . . . . . . . . . . . . . . . 91 

4.1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 91 

4.1.2 How the 2D algorithm works . . . . . . . . . . . . . . . 95 

4.1.3 Compression coefficient . . . . . . . . . . . . . . . . . . 96 

4.1.4 Reconstruction error . . . . . . . . . . . . . . . . . . . 97 

4.2 CARLOS v3 vs. the previous prototypes . . . . . . . . . . . . 98 

4.3 The final readout architecture . . . . . . . . . . . . . . . . . . 101 

4.4 CARLOS v3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 

4.5 CARLOS v3 building blocks . . . . . . . . . . . . . . . . . . . 103 

4.5.1 The channel block . . . . . . . . . . . . . . . . . . . . 105 

4.5.2 The encoder block . . . . . . . . . . . . . . . . . . . . 105 

4.5.3 The barrel15 block . . . . . . . . . . . . . . . . . . . . 107 

4.5.4 The fifonew32x15 block . . . . . . . . . . . . . . . . . 108 

4.5.5 The channel-trigger block . . . . . . . . . . . . . . . . 111 

4.5.6 The ttc-rx-interface block . . . . . . . . . . . . . . . . 112 

4.5.7 The fifo-trigger block . . . . . . . . . . . . . . . . . . . 112 

4.5.8 The event-counter block . . . . . . . . . . . . . . . . . 113 

4.5.9 The outmux block . . . . . . . . . . . . . . . . . . . . 113 

4.6 

4.5.10 The trigger-interface block . . . . . . . . . . . . . . . . 116 

4.5.11 The cmcu block . . . . . . . . . . . . . . . . . . . . . . 117 

4.5.12 The pattern-generator block . . . . . . . . . . . . . . . 119 

4.5.13 The signature-maker block . . . . . . . . . . . . . . . . 121 

Digital design flow for CARLOS v3 . . . . . . . . . . . . . . . 122 

4.7 CARLOS layout features . . . . . . . . . . . . . . . . . . . . . 123 

5 Wavelet based compression algorithm 125 

5.1 Wavelet based compression algorithm . . . . . . . . . . . . . . 126 

5.1.1 Configuration parameters of the multiresolution algorithm 

. . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 

5.2 Multiresolution algorithm optimization . . . . . . . . . . . . . 129 

5.2.1 The Wavelet Toolbox from Matlab . . . . . . . . . . . 130 

5.2.2 Choice of the filters . . . . . . . . . . . . . . . . . . . . 131 

vii

viii 

CONTENTS 

5.2.3 Choice of the dimensionality, number of levels and threshold 

value . . . . . . . . . . . . . . . . . . . . . . . . . . 139 

5.3 Choice of the architecture . . . . . . . . . . . . . . . . . . . . 141 

5.3.1 Simulink and the Fixed-Point Blockset . . . . . . . . . 141 

5.3.2 Choice of the architecture . . . . . . . . . . . . . . . . 143 

5.4 Multiresolution algorithm performances . . . . . . . . . . . . . 149 

5.5 Hardware implementation . . . . . . . . . . . . . . . . . . . . 151 

Conclusions 159 

Bibliography 161

Introduction 

This thesis work has been aimed at the hardware implementation of data 

compression algorithms to be applied to High Energy Physics Experiments. 

The amount of data that will be produced by LHC experiments at CERN 

is of the order of magnitude of 1 GByte/s. Cost constraints on magnetic 

tapes and data acquisition systems (optical fibres, readout boards) require 

to apply on-line data compression on the front-end electronics of the different 

detectors. This leads to the search of the compression algorithms allowing to 

achieve a high compression ratio, while keeping low the value of the reconstruction 

error. In fact a high compression coefficient can only be achieved 

at the expense of some loss on the physical data. 

The thesis contains the description of the hardware implementation of compression 

algorithms applied to the ALICE experiment for what concerns the 

SDD (Silicon Drift Detector) readout chain. The total amount of data produced 

by SDDs is 32.5 MBytes per event, while the reserved space on magnetic 

tapes for permanent storage is 1.5 MBytes. This means that the compression 

coefficient has to be at least 22. Beside that, since the p-p interaction 

rate is 1000 Hz, data compression hardware has to complete its job within 1 

ms. This leads to the search for high performances compression algorithms 

for what concerns both compression ratio and execution speed. 

The thesis contains a description of the design and implementation of 3 

prototypes of the ASIC CARLOS (Compression And Run Length encOding 

Subsystem) which deals with the on-line data compression, packing and 

transmission to the standard ALICE data acquisition system. CARLOS v1 

and v2 contain a uni-dimensional compression algorithm based on threshold, 

run length encoding, differential encoding and Huffman coding techniques. 

ix

x 

Introduction 

CARLOS v3 was meant to contain a bi-dimensional compression algorithm 

that obtains a better compression ratio than 1D with a lower physical data 

loss. Nevertheless, for time reasons, the design of CARLOS v3 sent to the 

foundy contains a simple 1D look-up table based compression algorithm. The 

2D algorithm is about to be implemented in the next prototype, which should 

be the final version of CARLOS. The first two prototypes have been tested 

with good results; the third one is in realization phase up to now and its test 

will begin from February 2002. 

Beside that, the thesis contains a detailed study of a wavelet-based compression 

algorithm, which obtains encouraging results for what concerns both 

compression ratio and reconstruction error. The algorithm may find a suitable 

application as a second level compressor on SDD data in the case that 

it might become necessary to switch off the compression algorithm implemented 

on CARLOS. 

The thesis is structured in the following way: 

• Chapter 1 contains a description of the ALICE experiment, especially 

for what concerns the SDD readout architecture. 

• Chapter 2 contains an introduction to standard compression algorithms. 

• Chapter 3 contains a description of the 1D algorithm developed at the 

INFN Section of Torino and the two prototypes CARLOS v1 and v2. 

• Chapter 4 focuses on the 2D compression algorithm and on the design 

and implementation of the prototype CARLOS v3. 

• Chapter 5 contains a description of a wavelet-based compression algorithm 

especially tuned to reach high performances on SDD data and 

its possible application to a second level compressor in counting room.

Chapter 1 

The ALICE experiment 

ALICE (A Large Ion Collider Experiment) [1] is an experiment at the Large 

Hadron Collider (LHC) [2] optimized for the study of heavy-ion collisions, 

at a centre-of-mass energy of 5.5 TeV per nucleon. The main aim of the 

experiment is to study in details the behaviour of nuclear matter at high 

densities and temperatures, in view of probing deconfinment and chiral symmetry 

restoration. 

The detector [1, 3] consists essentially of two main components: the central 

part, composed of detectors mainly devoted to the study of hadronic signals 

and dielectrons, and the forward muon spectrometer, devoted to the study 

of quarkonia behaviour in dense matter. The layout of the ALICE set-up is 

shown in Fig. 1.1. 

A major technical challenge is imposed by the large number of particles created 

in the collisions of lead ions. There is a considerable spread in the 

currently available predictions for the multiplicity of charged particles produced 

in a central Pb-Pb collision. The design of the experiment has been 

based on the highest value, 8000 charged particles per unit of rapidity, at 

midrapidity. This multiplicity dictates the granularity of the detectors and 

their optimal distance from the colliding beams. The central part, which 

covers ±45◦ (η ≤ 0.9) over the full azimuth, is embedded in a large magnet 

with a weak solenoidal field. Outside of the Inner Tracking System (ITS), 

there are a cylindrical TPC (Time Projection Chamber) and a large area PID 

array of time-of-flight (TOF) counters. In addition, there are two small-area 

1

2 


Figure 1.1: Longitudinal section of the ALICE detector 

single-arm detectors: an electromagnetic calorimeter (Photon Spectrometer, 

PHOS) and an array of RICH counters optimized for high-momentum inclusive 

particle identification (HMPID). 

My thesis work has been focused on data coming from one of the three detectors 

forming the ITS, the Silicon Drift Detector (SDD). 

1.1 The Inner Tracking System 

The basic functions of the ITS [4] are: 

• determination of the primary vertex and of the secondary vertices necessary 

for the reconstruction of charm and hyperon decays; 

• particle identification and tracking of low-momentum particles; 

• improvement of the momentum and angle measurements of the TPC.

1.1 — The Inner Tracking System 

1.1.1 Tracking in ALICE 

Track finding in heavy-ion collisions at the LHC presents a big challenge, 

because of the extremely high track density. In order to achieve 

a high granularity and a good two-track separation, ALICE uses threedimensional 

hit information, wherever feasible, with many points on 

each track and a weak magnetic field. The ionization density of each 

track is measured for particle identification. The need for a large number 

of points on each track has led to the choice of a TPC as the main 

tracking system. In spite of its drawbacks, concerning speed and data 

volume, only this device can provide reliable performance for a large 

volume at up to 8000 charged particles per unit of rapidity. The minimum 

possible inner radius of the TPC (rin =90cm)isgivenbythe 

maximum acceptable hit density. The outer radius (rout = 250 cm) 

is determined by the minimum length required for a dE/dx resolution 

better than 10 %. At smaller radii, and hence larger track densities, 

tracking is taken over by the ITS. 

The ITS consists of six cylindrical layers of silicon detectors. The number 

and position of the layers are optimized for efficient track finding 

and impact parameter resolution. In particular, the outer radius is 

determined by the track matching with the TPC, and the inner one 

is the minimum compatible with the radius of the beam pipe (3 cm). 

The silicon detectors feature the high granularity and excellent spatial 

precision required. 

Because of the high particle density, up to 90 cm−2 , the four innermost 

layers (r ≤ 24 cm) must be truly two-dimensional devices. For 

this task, silicon pixel and silicon drift detectors were chosen. The 

outer two layers at r = 45 cm, where the track densities are below 

1 cm−2 , are equipped with double-sided silicon micro-strip detectors. 

With the exception of the two innermost pixel planes, all layers have 

analog readout for particle identification via a dE/dx measurement 

in the non-relativistic region. This gives the inner tracking system a 

stand-alone capability as a low-pt particle spectrometer. 

3

4 


1.1.2 Physics of the ITS 

The ITS will contribute to the track reconstruction by improving the 

momentum resolution obtained by the TPC. This will be beneficial for 

practically all physics topics which will be addressed by the ALICE experiment. 

The global event features will be studied by measuring the 

multiplicity distributions and the inclusive particle spectra. For the 

study of resonance production (ρ, ω and φ), and, more important, the 

behaviour of the mass and width of these mesons in the dense medium, 

the momentum resolution is even more important. We have to achieve 

a mass precision comparable to, or better than, the natural width of 

the resonances in order to observe changes of their parameters caused 

by chiral symmetry restoration. Also the mass resolution for heavy 

states, like D mesons, J/ψ and Υ, will be better, thus improving the 

signal-to-background ratio in the measurement of the open charm production, 

and in the study of heavy-quarkonia suppression. Improved 

momentum resolution will enhance the performances in the observation 

of another hard phenomenon, the jet production and predicted jet 

quenching, i.e. the energy loss of partons in strongly interacting dense 

matter. 

The low-momentum particles (below 100 MeV/c) will be detectable 

only by the ITS. This is of interest in itself, because it widens the momentum 

range for the measurement of particle spectra, which allows 

collective effects associated with the large length scales to be studied. 

In addition, a low-pt cut-off is essential to suppress the soft gamma 

conversions and the background in the electron-pair spectrum due to 

Dalitz pairs. Also the PID capabilities of the ITS in the non-relativistic 

(1/β2 ) region will therefore be of great help. 

In addition to the improved momentum resolution, which is necessary 

for the identical particle interferometry, especially at low momenta, the 

ITS will contribute to this study through an excellent double-hit resolution 

enabling the separation of tracks with close momenta. In order 

to be able to study particle correlations in the three components of


their relative momenta, and hence to get information about the space 

time evolution of the system produced in heavy-ion collisions at the 

LHC, we need sufficient angular resolution in the measurement of the 

particle’s direction. Two of the three components of the relative momentum 

(the side and longitudinal ones) are crucially dependent on 

the precision with which the particle direction is known. The angular 

resolution is determined by the precise ITS measurements of the primary 

vertex position and of the first points on the tracks. The particle 

identification at low momenta will enhance the physics capability by 

allowing the interferometry of individual particle species as well as the 

study of non-identical particle correlations, the latter giving access to 

the emission time of different particles. 

The study of strangeness production is an essential part of the ALICE 

physics program. It will allow the level of chemical equilibration and 

the density of strange quarks in the system to be established. The measurement 

will be performed by charge kaon identification and hyperon 

detection, based on the ITS capability to recognize secondary vertices. 

The observation of multi-strange hyperons (Ξ − and Ω − ) is of particular 

interest, because they are unlikely to be produced during the hadronic 

rescattering due to the high-energy threshold for their production. In 

this way we can obtain information about the strangeness density of 

the earlier stage of the collision. 

Open charm production in heavy-ion collisions is of great physics interest. 

Charmed quarks can be produced in the initial hard parton 

scattering and then only at the very early stages of the collision, while 

the energy in parton rescattering is above the charm production threshold. 

The charm yield is not altered later. The excellent performance of 

the ITS in finding the secondary vertices close to the interaction point 

gives us the possibility to detect D mesons, by reconstructing the full 

decay topology. 

5

6 


Figure 1.2: ITS layers 

1.1.3 Layout of the ITS 

A general view of the ITS is shown in Fig. 1.2. The system consists 

of six cylindrical layers of coordinate-sensitive detectors, covering the 

central rapidity region (η ≤ 0.9) for vertices located within the length 

of the interaction diamond (2σ), i.e. 10.6 cm along the beam direction 

(z). The detectors and front-end electronics are held by lightweight 

carbon-fibre structures. The geometrical dimensions and the main features 

of the various layers of the ITS are summarized in Table 1.1. 

The granularity required for the innermost planes is achieved with 

silicon micro-pattern detectors with true two-dimensional readout: Silicon 

Pixel Detectors (SPD) and Silicon Drift Detectors (SDD). At larger 

radii, the requirements in terms of granularity are less stringent, therefore 

double-sided Silicon Strip Detectors (SSD) with a small stereo 

angle are used. Double-sided microstrips have been selected rather 

than single-sided ones because they introduce less material in the active 

volume. In addition they offer the possibility to correlate the pulse 

height read out from the two sides, thus helping to resolve ambiguities 

inherent in the use of detectors with projective readout. The main 

parameters for each of the three detector types are: spatial precision, 

two-track resolution, pixel size, number of channels of an individual 

detector, total number of electronic channels are shown in Table 1.1.


Parameter Pixel Drift Strip 

Spatial precision rφ µm 12 38 20 

Spatial precision z µm70 28 830 

Two-track resolution rφ µm 100 200 300 

Two-track resolution z µm600 600 2400 

Cell size µm2 50 x 300 150 x 300 95 x 40000 

Active area mm2 13.8 × 82 72.5 × 75.3 73× 40 

Readout channels per module 65536 2 x 256 2 x 768 

Total number of modules 240 260 1770 

Total number of readout channels k 15729 133 2719 

Total number of cells M 15.7 34 2.7 

Average occupancy (inner layer) 1.5 2.5 4 

Average occupancy (outer layer) 0.4 1.0 3.3 

Table 1.1: Main features of ITS detectors 

The large number of channels in the layers of the ITS requires a large 

number of connections from the front-end electronics to the detector 

and to the data acquisition system. The requirement for a minimum of 

material within the acceptance does not allow the use of conventional 

copper cables near the active surfaces of the detection system. Therefore 

Tape Automatic Bonded (TAB) aluminium multilayer microcables 

are used. 

The detectors and their front-end electronics produce a large amount 

of heat which has to be removed while keeping a very high degree of 

temperature stability. In particular, the SDDs are sensitive to temperature 

variations in the 0.1 ◦C range. For these reasons, particular care 

was taken in the design of the cooling system and of the temperature 

monitoring. A water cooling system at room temperature is the chosen 

solution for all ITS layers, but the use of other liquid coolants is still 

being considered. For the temperature monitoring dedicated integrated 

circuits are mounted on the readout boards and specific calibration devices 

are integrated in the SDDs. 

The outer four layers of the ITS detectors are assembled onto a me- 

7

8 


Figure 1.3: SDD prototype: 1) active area, 2) guard area. 

chanical structure made of two end-cap cones connected by a cylinder 

placed between the SSD and the SDD layers. Both the cones and the 

cylinder are made of lightweight sandwiches of carbon-fibre plies and 

Rohacell TM . The carbon-fibre structure includes also the appropriate 

mechanical links to the TPC and to the SPD layers. The latter 

are assembled in two half-cylinder structures, specifically designed for 

safe installation around the beam pipe. The end-cap cones provide the 

cabling and cooling connection of the six ITS layers with the outside 

services. 

1.2 Design of the drift layers 

SDDs (a picture is shown in Fig. 1.3) have been selected to equip the 

two intermediate layers of the ITS, since they couple a very good multitrack 

capability with dE/dx information. At least three measured 

samples per track, and therefore at least four layers carrying dE/dx 

information are needed. The SDDs, 7.25 × 7.53 cm2 active area each,

1.2 — Design of the drift layers 

Figure 1.4: Longitudinal section of ITS layer 3andlayer 4 

will be mounted on linear structures called ladders, each holding six 

detectors for layer 3 and eight detectors for layer 4 (see Fig. 1.4). 

The layers will sit at the average radius of 14.9 and 23.8 cm from 

the beam pipe and will be composed of 14 and 22 ladders respectively. 

The front-end electronics will be mounted on rigid heat-exchanging hybrids, 

which in turn will be connected onto cooling pipes running along 

the ladder structure. The connections between the detectors and the 

front-end electronics, and between both and the ends of the ladders will 

be assured with flexible Al microcables, TAB bonded, which will carry 

both data and power supply lines. Each detector will be first assembled 

together with its front-end electronics and high-voltage connections as 

9

10 


n 

+ 

p 

+ 

p 

+ 

p 

+ 

p 

+ 

p 

+ 

+ 

+ 

− + 

−− 

p 

+ 

p 

+ 

p 

+ 

p 

+ 

p 

+ 

p 

+ 

Figure 1.5: Working mode of a SDD detector 

a unit, hereafter called a module, which will be fully tested before it is 

mounted on the ladder. 

1.3 The SDDs (Silicon Drift Detectors) 

SDDs, like gaseous drift detectors, exploit the measurement of the 

transport time of the charge deposited by a transversing particle to 

localize the impact point in two dimensions, thus enhancing resolution 

and multi-track capability at the expense of speed. They are therefore 

well suited to this experiment in which very high particle multiplicities 

are coupled with relatively low event rates (up to some KHz). A linear 

SDD, shown schematically in Fig. 1.5, has a series of parallel implanted 

p + field strips, connected to a voltage divider on both surfaces of the 

high-resistivity n-type silicon wafer. The voltage divider is integrated 

on the detector substrate itself. The field strips provide the bias voltage 

to fully deplete the volume of the detector and they generate an electrostatic 

field parallel to the wafer surface, thus creating a drift region 

(see Fig. 1.6). Electron-hole pairs are created by the charged particles 

crossing the detector. The holes are collected by the nearest p + 

electrode, while the electrons are focused into the middle plane of the 

detector and driven by the drift field towards the edge of the detector 

x 

z 

y

1.3 — The SDDs (Silicon Drift Detectors) 

Figure 1.6: Potential energy of electrons (negative electric potential) on 

the y-z plane of the device 

where they are collected by an array of anodes composed of n + pads. 

So far an electronic charge cloud drifts from the impact point to the anode 

region: the cloud shows a bell-shaped Gaussian distribution that, 

owing to the diffusion and mutual repulsion, during the drift becomes 

smaller and larger [5] (see Fig. 1.7). In this way a charge cloud can 

be collected by one or more anodes depending on the charge released 

by the ionizing particle and on the impact position with respect to the 

anode region. The small size of the anodes, and hence their small capacitance 

(50 fF), imply low noise and good energy resolution. 

The coordinate perpendicular to the drift direction is given by the centroid 

of the collected charge. The coordinate along the drift direction is 

measured by the centroid of the signal in the time domain, taking into 

account the amplifier response. A space precision, averaged over the 

full detector surface, better than 40 µm in both coordinates has been 

obtained during beam tests of full-size prototype detectors. Each SDD 

module is divided in two half-detectors: each half-detector contains on 

the external side 256 anodes at a distance of 300 µmfromeachanother. 

So far each SDD detector contains 2 x 256 readout channels: taking 

into account that the layer 3 and 4 contain 260 SDD modules, the total 

number of SDD readout channels is around 133k. 

11

12 

Time axis 


Drift 

Anode axis 

Figure 1.7: Charge distribution evolution scheme 

1.4 SDD readout system 

The system requirements for the SDD readout system derive from both 

the features of the detector and the ALICE experiment in general. The 

following points are crucial in the definition of the final readout system: 

– The signal generated by the SDD is a Gaussian shaped current 

signal, with variable sigma and charge (5-30 ns and 4 to 32 fC) 

and can be collected by one or more anodes. Therefore the frontend 

electronics should be able to handle analog signals in a wide 

dynamic range. Then, the system noise should be very low while 

being able to handle large signals. 

– The amount of data generated by the SDD is very large: each half 

detector has 256 anodes and for each anode 256 time samples have 

to be taken in order to cover the full drift length. 

– The small space available on the ladder and the constraints on 

material impose an architecture which minimizes cabling. 

– The radiation environment in which the front-end electronics has 

to work imposes the choice of a radiation tolerant technological

PASCAL 

AMBRA 

1.4 — SDD readout system 

SIU 

. 

. 

. 

End ladder module 

Front−end module 

SDD detectors 

Test and slow control 

CARLOS 

Figure 1.8: SDD ladder electronics 

library for the implementation of the electronics. 

The chosen SDD readout electronics, shown in Fig. 1.8, consists of 

front-end modules and end-ladder modules. The front-end module performs 

analog data acquisition, A/D conversion and buffering, while the 

end-ladder module contains high voltage and low voltage regulators and 

a chip for data compression and interfacing the ALICE DAQ system. 

13

14 


Figure 1.9: The front-end readout unit 

1.4.1 Front-end module 

The front-end modules, one per half-detector, are distributed along the 

ladders together with the SDD modules. Each front-end module contains 

4 PASCAL (Preamplifier, Analog Storage and Conversion from 

Analog to digitaL) - AMBRA (A Multievent Buffer Readout Architecture) 

chips pairs, as shown in Fig. 1.9. The PASCAL chips are 

TAB-bonded directly on the SDD output anodes, while the AMBRA 

chips are connected to CARLOS (Compression And Run Length encOding 

Subsystem) via an 8-bit bus. 

Each PASCAL chip contains three functional blocks (see Fig. 1.10): 

– low noise preamplifiers (they are 64, one for each anode); 

– an analog memory working at a 40 MHz clock frequency (64×256 

cells); 

– 10-bit analog to digital converters ADC, (they are 64, one for each 

channel). 

During the write phase, i.e. when no trigger signal has been received, 

the preamplifiers continuosly write the samples into the analog memory

data_in[0] 





Preamplifiers 

pa_cal 


Analog memory 

... 

... 

... 

... 

... 

Analog memory 

control unit 

Figure 1.10: PASCAL chip architecture 

A/D conversion, buffering and multiplexing 

ADC 

ADC 

ADC 

ADC 

ADC 

Interface control unit 

reset 

clock 

data_out 

start_op 

end_op 

write_req 

write_ack 

jtag_bus 

cells at 40 MHz, while the ADCs are in stand-by mode. When PAS- 

CAL receives a trigger signal from CARLOS (that receives it from the 

Central Trigger Processor, CTP) , a control logic module on the PAS- 

CAL chip stops the analog memory write phase, freezes its contents 

and starts the read phase, performed in two steps: in the first step the 

ADCs are set to sample mode and the analog memory reads out the 

first sample for each anode row; after the memory settling time, the 

ADCs switch to the conversion mode and analog data are converted 

to digital through a successive approximation technique. When the 

conversion is finished, the control logic module on PASCAL starts the 

15

16 


Input range Output codes Code mapping Bits lost 

0-127 from 128 to 128 0xxxxxxx 0 

128-255 from 128 to 32 100xxxxx 2 

256-511 from 256 to 32 101xxxxx 3 

512-1023 from 512 to 64 11xxxxxx 3 

Table 1.2: Digital compression from 10 to 8 bits 

readout of the next sample from the analog memory and, at the same 

time, sends the 64 digital words to the AMBRA chip using a 40-bit 

wide bus. The read phase goes on until all the analog memory content 

has been converted to digital values or an abort signal comes from 

CARLOS (again receiving it from the CTP), meaning that the event 

has to be discarded. 

The AMBRA chip has mainly two functions: first, AMBRA has to 

compress data from 10 to 8 bits per sample, then it has to store the 

input data stream into a digital buffer. The principle used for compression 

is to decrease the resolution for larger signals with a logarithmic 

or square-root law using the mapping shown in Table 1.2. Since the 

larger signals have better signal to noise ratio than the smaller ones, 

the accuracy of the measurement is not affected. 

The 4 AMBRA chips are static RAM able to contain 256 KBytes, 

thus being able to temporarily store 4 half-SDD complete events (one 

event corresponds to 256 × 256 Bytes = 64 KBytes). Data read/write 

stages are allowed at the same time: so far while the PASCAL chips 

are transferring data to the AMBRA ones, the AMBRA chips can send 

data belonging to an other event to the CARLOS chip. Actually, since 

four AMBRA chips have to transmit data over a single 8-bit bus, an 

arbitration mechanism has been implemented.


1.4.2 Event-buffer strategy 

The dead time due to the SDD readout system is around 358.4 µs: this 

is, in fact, the time needed for reading a cell of the analog memory and 

for converting it into a digital word, 1.4 µs, multiplied by the number 

of cells, 256. This means that a new trigger signal will not be accepted 

before 358.4 µs have passed after the previous event. Every 1.4 µs each 

detector produces 512 bytes of data, then at least 10 8-bit buses per 

detector working at 40 MHz are required for data transfer. Unfortunately 

the space on the ladder is very limited and managing 80 data 

lines for each detector (for a total of 320 for the half-ladder) is a very 

serious problem, especially for the input connections to the end-ladder 

readout units. 

The adopted solution to insert a digital multi-event buffer on the frontend 

readout unit between PASCAL and CARLOS allows to send data 

towards the end-ladder unit at a lower speed, in fact if an other event 

arrives while transmitting data from AMBRA to CARLOS, an other 

digital buffer on AMBRA is ready to accept data coming from PAS- 

CAL. Data is transferred from AMBRA to CARLOS using an 8-bit 

bus in 1.65 ms (25 ns x 64 Kwords) while other events are processed 

by PASCAL and sent to AMBRA. For an average Pb-Pb event rate of 

40 Hz and using a double-event digital buffer, our simulations indicate 

that the dead time due to buffer overrun is only 0.1 % of the total time. 

This is the amount of time during which AMBRA is transferring data 

to CARLOS and the other buffer in AMBRA is full: in this situation 

a BUSY signal is asserted towards the CTP, meaning that no further 

trigger can be accepted. In order to reach a much smaller amount of 

dead time even with higher event rates, a decision was taken to have a 

4-buffer-deep AMBRA device. 

In order to allow the full testability of the readout electronics at the 

board and system levels, the ASICs embody a JTAG standard interface. 

In this way it is possible to test each chip after the various assembly 

stages and during the run phase in order to check correct functionality. 

17

18 


Layer Ladders Detectors/ladder Data/ladder Total data 

3 14 6 768 KBytes 10.5 MBytes 

4 22 8 1 MByte 22 MBytes 

Both 32.5 MBytes 

Table 1.3: Total amount of data produced by SDDs 

The same interface is used to download control information into the 

chips. 

Radiation tolerant deep-submicron processes (0.25 µm) has been used 

for the final versions of the ASICs. These technologies are now available 

and allow us to reduce size and power consumption with no degradation 

of the signal processing speed. Moreover, it has been shown that they 

have a better resistance to radiation when specific layout techniques 

are used, if compared to commercially available technologies. 

1.4.3 End-ladder module 

The end-ladder modules are located at both ends of each ladder (2 

per ladder); they receive data from the front-end modules, perform 

data compression with the CARLOS chip and send data to the DAQ 

through an optical fibre link. 

Beside that, the end-ladder board will host the TTCrx device, a 

chip receiving the global clock and trigger signals from the CTP and 

distributing it to PASCAL, AMBRA and CARLOS, and the power regulators 

for the complete ladder system. 

CARLOS receives 8 data streams coming from 8 half-detectors, i.e. 

from one half-ladder, for a total volume of data of 64 KBytes × 8= 

512 KBytes, at a rate of 320 MByte/s in input. Taking into account the 

number of ladders and detectors per ladder (see Table 1.3), the total 

volume of data produced by all the SDD modules amounts to around 

22 MBytes per event, while the space reserved on disk for permanent 

storage is 1.5 MBytes. This implies to use a compression algorithm


with a compression coefficient of at least 22 and a reconstruction error 

as low as possible, in order to minimize physical information loss. 

Moreover since the trigger rate in proton-proton interactions amounts 

to 1 KHz, each event should be compressed and sent to the DAQ system 

within 1 ms. Actually, thanks to the buffering provided by the 

AMBRA chips, this processing time doubles to 2 ms, thus relaxing the 

timing constraint on the CARLOS chip. 

These constraints led us to the design and implementation of a first 

prototype of CARLOS. Then the desire to have better compression 

performances and changes in the readout architecture due to the presence 

of radiations led us to the design and implementation of other two 

CARLOS prototypes. We are now going to design CARLOS v4 that 

is intended to be the final version of the compression ASIC. The first 

3 prototypes of the device CARLOS are explained in details in chapters 

3 and 4, while chapter 2 contains a review of existent compression 

techniques. 

1.4.4 Choice of the technology 

The effects of radiations on electronics circuits can be divided in total 

dose effects and single event effects (SEU) [6]. Total dose modifies the 

thresholds of MOS transistors and increases leakage currents. This is of 

particular concern in leakage sensitive analog circuits, like analog memories. 

For instance, assuming for the storage capacitors in the memory 

a value of 1 pF, a leakage current as small as 1 nA would change the 

value of the stored information by 0.2 V in 200 µs. This is of course 

unacceptable. 

Radiation tolerant layout practices prevent this risk and their use in 

analog circuits is therefore recommended. These designs techniques become 

extremely effective in deep-submicron CMOS technologies. Single 

event effects can trigger latch-up phenomena or can change the value 

of digital bits (Single Event Upset). Latch-up can be prevented with 

the systematic use of guard rings in the layout. Single event upset can 

19

20 


be a problem especially when occurring in the digital control logic and 

can be prevented by layout techniques or by redundancy in the system. 

Radiation tolerant layouts have of course area penalties. It can 

be estimated that in a given technology a minimum size inverter with 

radiation tolerant layout is 70% bigger than the corresponding inverter 

with standard layout. Nevertheless, a radiation tolerant inverter in a 

quarter micron technology is about eight times smaller than a standard 

inverter in a 0.8 µm technology. The radiation dose which will be received 

by the readout electronics will be quite low, below 100 Krad in 

10 years. This value is probably below the limit of what a standard 

technology can afford; however conservative considerations suggested 

the use of radiation tolerant techniques for critical parts of the circuit. 

These techniques have been proven to work up to 30 MRad and allow 

a lower area penalty and lower cost compared with the radiation hard 

processes. So far the library chosen for the implementation of PAS- 

CAL, AMBRA and CARLOS chips is the 0.25 µm IBM technology 

with standard cells designed at CERN to be radiation tolerant.

Chapter 2 

Data compression techniques 

Data compression [7] is the art of science of representing information in 

a compact form. These compact representations are created by identifying 

and using structures that exist in the data. Data can be characters 

in a text file, numbers that are samples of speech or image waveforms 

or sequences of numbers that are generated by physical processes. 

Data compression plays an important role in many fields, for example 

in digital television signals transmission. If we wanted to transmit an 

HDTV (High Definition TeleVision) signal without any compression, we 

would need to transmit about 884 Mbits/s. Using data compression, 

we need to transmit less than 20 Mbits/s along with audio information. 

Compression is now very much a part of everyday life. If you use computers 

you are probably using a variety of products that make use of 

compression. Most modems now have compression capabilities that allow 

to transmit data many times faster than otherwise possible. File 

compression utilities, that permit us to store more on our disks, are 

now commonplace. 

This chapter contains an introduction to data compression with a description 

of the most commonly used compression algorithms, with the 

aim of finding out the most suitable compression technique for physical 

data coming out from the SDD. 

21

22 


2.1 Applications of data compression 

An early example of data compression is the Morse code, developed 

by Samuel Morse in the mid-19th century. Letters sent by telegraph 

are encoded with dots and dashes. Morse noticed that certain letters 

occurred more often than others. In order to reduce the average time 

required to send a message, he assigned shorter sequences to letters that 

occur more frequently such as a (· −)ande (·) and longer sequences to 

letters that occur less frequently such as q (− −·−)orj (· −−−). 

What is being used to provide compression in the Morse code is the 

statistical structure of the message to compress, i.e. the message contains 

letters with a probability to occurr higher than others. So far 

most compression techniques exploit the input statistical structure to 

provide compression, but this is not the only kind of structure that 

exists in the data. 

There are many other kinds of structures in data of differents types that 

can be exploited for compression. Let us take speech as an example. 

When we speak, the physical construction of our voice box dictates the 

kinds of sounds that we can produce, that is the mechanics of speech 

production impose a structure on speech. Therefore, instead of transmitting 

the sampled speech itself we could send information about the 

conformation of the voice box, which could be used by the receiver to 

synthesize the speech. An adequate amount of information about the 

conformation of the voice box can be represented much more compactly 

than the sampled values of the speech. This compression approach is 

being used currently in a number of applications, including transmission 

of speech over mobile radios and the synthetic voice in toys that 

speak. 

Data compression can also take advantage of some redundant structure 

of the input signal, that is a structure containing more information than 

needed. For example if a sound has to be transmitted for being heard 

by a human being, all frequencies below 20 Hz and above 20 KHz 

can be eliminated (thus providing compression) since these frequencies

2.2 — Remarks on information theory 

cannnot be perceived by humans. 

2.2 Remarks on information theory 

Without going into details we just want to recall Shannon’s theorem [8]. 

He defines the information contents of a message in the following way: 

given a message which is made up of N characters in total containing 

n different symbols, the information contents measured in bits of the 

message is the following: 

n 

I = N (−pilog(pi)) (2.1) 

i=1 

where pi is the occurrence probability of symbol i. 

What is regarded as a symbol depends on the application: it might be 

an ASCII code, 16 or 32 bit words, words in a text and so on. 

A practical illustration of the Shannon theorem is the following: let 

us assume to measure a charge or any other physical quantity using 

an 8-bit digitizer. Very often measured quantities will be distributed 

approximately exponentially. Let us assume that the mean value of 

the statistical distribution is one tenth of the dynamic range, i.e. 25.6. 

Each value between 0 and 255 is regarded as a symbol. Applying the 

−(i+0.5) 

e 25.6 

Shannon’s formula with n = 256 and pi = we obtain a mean 

25.6 

information content I/N of 6.11 bits per measured value which is almost 

25% less than the 8 bits we need saving the data as a sequence 

of bytes. Even if we had increased the dynamic range by a factor of 4 

using a 10-bit ADC, it turns out that the mean information contents 

expressed as the number of bits per measurement would have been virtually 

the same and hence the possible compression gain even higher 

(39%). This might be surprising but considering that an exponential 

distribution delivers a value beyond ten times the mean only every e10 = 22026 samples, it is clear that even using a quite long code for such 

measurements cannot have an appreciable influence on the compression 

23

24 


rates. Considering that with all likelihood in a realistic architecture we 

would have had to expand the 10 bits to 16, the gain is impressive 62% 

in the latter case. 

The exponential distribution is a good approximation of the raw data in 

many cases and in particular for data coming out from the SDD. Comparing 

various probability distributions with the same RMS it seems 

that the exponential distribution is particularly hard to compress. For 

instance a discrete spectrum being distributed according to a Gaussian 

with the same RMS as the above exponential only has an information 

contents of 4.75 bits. 

2.3 Compression techniques 

When we speak of a compression technique or a compression algorithm 

we actually refer to two algorithms: the first one takes an input X 

and generates a representation XC that requires fewer bits; the second 

one is a reconstruction algorithm that operates on the compressed 

representation XC to generate the reconstruction Y . Based upon the 

requirements of reconstruction, data compression schemes can be divided 

into two broad classes: 

– lossless compression schemes, in which Y is identical to X; 

– lossy compression schemes, which generally provide much higher 

compression than lossless ones, but force Y to be different from 

X. 

In fact Shannon showed that the best performance achievable by a 

lossless compression algorithm is to encode a stream with an average 

number of bits equal to the I/N value. On the contrary lossy algorithms 

do not have upper bounds to the compression ratio.

2.3 — Compression techniques 

2.3.1 Lossless compression 

Lossless compression techniques involve no loss of information. If data 

have been losslessly compressed, the original data can be recovered 

exactly from the compressed data. Lossless compression is generally 

used for discrete data, such as text, computer-generated data and some 

kind of image and video information. There are many situations that 

require compression where we want the reconstruction to be identical 

to the original. There are also a number of situations in which it is 

possible to relax this requirement in order to get more compression: in 

these cases lossy compression techniques have to be used. 

2.3.2 Lossy compression 

Lossy compression techniques involve some loss of information and data 

that have been compressed using lossy techniques generally cannot be 

recovered or reconstructed exactly. In return for accepting distortion in 

the reconstruction, we can generally obtain much higher compression 

ratios than it is possible with lossless compression. Whether the distortion 

introduced is acceptable or not depends on the specific application: 

for instance if the input source X contains a physical information plus 

noise, while the output Y contains only the physical signal, the distortion 

introduced is completely acceptable. 

2.3.3 Measures of performance 

A compression algorithm can be evaluated in a number of different 

ways. We could measure the relative complexity of the algorithm, the 

memory required to implement the algorithm, how fast the algorithm 

performs on a given machine or on dedicated hardware, the amount of 

compression and how closely the reconstruction resembles the original. 

The last two features are the most important ones for our application 

to SDD data. 

25

26 


A very logical way of measuring how well a compression algorithm compresses 

a given set of data is to look at the ratio of the number of bits 

required to represent the data before compression to the number of bits 

required to represent the data after compression. This ratio is called 

compression ratio. Suppose of storing an image made up of a square 

array of 256x256 8-bit pixels (exactly as a half SDD): it requires 64 

KBytes. If the compressed image requires only 16 KBytes we would 

then say that the compression ratio is 4. 

Another way of reporting compression performance is to provide the 

average number of bits required to represent a single sample. This is 

generally referred to as the rate. For instance, for the same image described 

above, the average number of bits per pixel in the compressed 

representation is 2: thus the rate is 2 bits/pixel. 

In lossy compression the reconstruction differs from the original data. 

Therefore, in order to determine the efficiency of a compression algorithm, 

we have to find some way to quantify the difference. The difference 

between the original data and the reconstructed ones is often 

called distortion. This value is usually calculated as a mathematical or 

percentual difference among data before and after compression. 

2.3.4 Modelling and coding 

The development of data compression algorithms for a variety of data 

can be divided in two steps. The first phase is usually referred to 

as modelling. In this phase we try to extract information about any 

redundancy that exists in the data and describe the redundancy in the 

form of a model. The second phase is called coding. The description of 

the model and a description of how the data differ from the model are 

encoded, generally using a binary alphabet.

2.4 — Lossless compression techniques 

2.4 Lossless compression techniques 

This section contains an explanation of the most widely used lossless 

compression techniques. In particular the following items are covered: 

– Huffman coding; 

– runlengthencoding; 

– differential encoding; 

– dictionary techniques; 

– selective readout. 

Some of these algorithms have been chosen for direct application in the 

1D compression algorithm implemented in the prototypes CARLOS v1 

and v2. 

2.4.1 Huffman coding 

Huffman based compression algorithm [7] encodes data samples in this 

way: symbols that occur more frequently (i.e. symbols having a higher 

probability of occurrence) will have shorter codewords than symbols 

that occurr less frequently. This leads to a variable-length coding 

scheme, in which each symbol can be encoded with a different number 

of bits. The choice of the code to assign to each symbol or, in other 

words, the design of the Huffman look-up table is carried out with standard 

criteria. 

An example can better explain this sentence. Suppose to have 5 data, 

a1, a2, a3, a4 and a5, each one with a probability of occurrence, P (a1) = 

0.2, P (a2) =0.4, P (a3) =0.2, P (a4) =0.1, P (a5) =0.1; at first, in 

order to write down the encoding c(ai) of each data ai, it is necessary 

to order data from the higher probable to the lower probable one, as 

shown in Tab. 2.1. 

27

28 


Data Probability Code 

a2 0.4 c(a2) 

a1 0.2 c(a1) 

a3 0.2 c(a3) 

a4 0.1 c(a4) 

a5 0.1 c(a5) 

Table 2.1: Sample data and probability of occurrence 

The least probable data are a4 and a5; they are assigned the following 

codes: 

c(a4) = α1 ∗ 0 (2.2) 

c(a5) = α1 ∗ 1 (2.3) 

where α1 is a generic binary string and ∗ represents the concatenation 

between two strings. 

If a ′ 4 is a data for which the following relationship holds true P (a′ 4 )= 

P (a4)+P (a5) =0.2, then data in Tab. 2.1 can be reordered from the 

higher to the lower probable, as shown in Tab. 2.2. 


a2 0.4 c(a2) 

a1 0.2 c(a1) 

a3 0.2 c(a3) 

a ′ 4 0.2 α1 

Table 2.2: Introduction of data a ′ 4 

In this table lower probability data are a3 and a ′ 4 : so far they can be 

encoded in the following way: 

c(a3) = α2 ∗ 0 (2.4) 

c(a ′ 4 ) = α2 ∗ 1 (2.5) 

Nevertheless, being c(a ′ 4 )=α1, from Tab. 2.2, then from (2.5) follows


that α1 = α2 ∗ 1, e then, (2.2) and (2.3) become: 

c(a4) = α2 ∗ 10 (2.6) 

c(a5) = α2 ∗ 11 (2.7) 

Defining a ′ 3 as the data for which P (a′ 3 )=P (a3)+P (a ′ 4 )=0.4, data 

from Tab. 2.2 can be reordered from the higher probable to the lower 

probable as shown in Tab. 2.3. 


a2 0.4 c(a2) 

a ′ 3 0.4 α2 

a1 0.2 c(a1) 

Table 2.3: Introduction of data a ′ 3 

In Tab. 2.3 lower probability data are a ′ 3 and a1; so far they can be 

encoded in the following way: 

c(a ′ 3 ) = α3 ∗ 0 (2.8) 

c(a1) = α3 ∗ 1 (2.9) 

Being c(a ′ 3 )=α2, from Tab. 2.3, then from (2.8) follows α2 = α3 ∗ 0, 

so far (2.4), (2.6) and (2.7), become: 

Finally, by defining a ′′ 

3 

c(a3) = α3 ∗ 00 (2.10) 

c(a4) = α3 ∗ 010 (2.11) 

c(a5) = α3 ∗ 011 (2.12) 

as the data for which the following relationship 

holds true P (a ′′ 

3 )=P (a′ 3 )+P (a1) =0.6, data from Tab. 2.3 can be 

reordered from the higher probable to the lower probable as shown in 

Tab. 2.4. 

29

30 



a ′′ 

3 0.6 α3 

a2 0.4 c(a2) 

Table 2.4: Introduction of data a ′′ 

3 

Only two data being left, the encoding is immediate: 

c(a ′′ 

3) = 0 (2.13) 

c(a2) = 1 (2.14) 

Beside that, being c(a ′′ 

3 )=α3, as shown in Tab. 2.4, then from (2.13) 

the following relationship becomes α3 = 0, i.e., (2.9), (2.10), (2.11) and 

(2.12), can be written as: 

c(a1) = 01 (2.15) 

c(a3) = 000 (2.16) 

c(a4) = 0010 (2.17) 

c(a5) = 0011 (2.18) 

Tab. 2.5 contains a complete view of the Huffman table so far generated. 

The method used for building the Huffman table in this example can 

be applied as it is to every data stream having whichever statistical 

structure. Huffman codes c(ai), so far generated, can be univoquely 

decoded: this means that from a sequence of variable length codes 

c(ai) created using the Huffman coding, only one data sequence ai can 

be reconstructed. 

Beside that, as shown in the example in Tab. 2.5, none of the codes 

c(ai) is contained as a prefix in the remaining codes; codes following 

this property are named prefix codes. In particular prefix codes also 

follow the property of being univoquely decodable, while the contrary 

does not always hold true. 

Finally an Huffman code is defined an optimum code since, among all 

the prefix codes, it is the one that minimizes the average code length.



a2 0.4 1 

a1 0.2 01 

a3 0.2 000 

a4 0.1 0010 

a5 0.1 0011 

Table 2.5: Huffman table 

2.4.2 Run Length encoding 

Very often a data stream happens to contain long sequences of the 

same value: this may happen when a physical quantity holds the same 

value for several sampling periods, it can happen in text files where a 

character can be repeated several times, it can happen in digital images 

where spaces with the same color are encoded with pixels with the same 

value, and so on. The compression algorithm based on the Run Length 

[9] encoding is well suited for such repetitive data. 

As shown in the example in Fig. 2.1, where the zero symbol has been 

chosen as the repetitive data in the sequence, each zero sequence in the 

original sequence is encoded as a couple of words: the first contains 

the code for the zero symbol, the second contains the number of zero 

symbols consecutively occurred in the original sequence. 

The performances of the algorithm get better, in terms of compression 

ratio, when the input data stream contains long sub-sequences of the 

same symbol and when it contains few single subsequences, such as the 

second code, 0→00, in Fig. 2.1. Finally this compression algorithm can 

be implemented in different ways: it can be applied only on one value 

of the original data sequence or on different elements of the sequence. 

One of the most important applications of the Run Length encoding 

system is the compression of facsimile or fax. In facsimile transmission a 

page is scanned and converted into a sequence of white and black pixels: 

since it is highly probable to have very long sequences of white or black 

pixels, coding the lengths of runs instead of coding individual pixels 

31

32 

Original sequence 

Run Length 

encoded sequence 


17 8 54 0 0 0 97 5 16 0 45 23 0 0 0 0 43 

17 8 54 0 2 97 5 16 0 0 45 23 0 3 43 

Figure 2.1: Run length encoding 

leads to high compression ratios. Beside that Run Length encoding is 

often used in conjunction with other compression algorithms, after the 

input data stream has been transformed in a more compressible form. 

2.4.3 Differential encoding 

Differential encoding [7] is obtained performing the difference between 

one sample and the previous one, except for the first one, whose value 

is left unchanged, as shown in Fig. 2.2. 

It is to be noticed that each data of the original sequence can be reconstructed 

by summing to the corresponding data in the coded sequence 

all the previous data: for instance, 89 = 79+17+2+5+0+0+(−3)+ 

(−6) + (−5). So far it is very important to leave the first value in the 

coded sequence unchanged, otherwise the reconstruction process cannot 

be carried out correctly. The differential algorithm is well suited 

for all data sequences with very small changes, in value, between consecutive 

samples: in fact for this kind of data streams the differential 

encoding produces an encoded stream with a smaller dynamics, i.e. the 

difference between the maximum and minimum values in the encoded 

stream is smaller than the same value calculated in the original sequence. 

So far the encoded sequence can be represented with a smaller 

number of bits than the original one.

Original sequence 

Sequence after 

differential encoding 


17 19 24 24 24 21 15 10 89 95 96 96 96 95 94 94 95 

... 

17 2 5 0 0 −3 −6 −5 79 6 1 0 0 −1 −1 0 1 

Figure 2.2: Differential encoding 

Beside that the differential encoding can be used in conjunction with 

the Run Length encoding system: in fact, if a sequence contains long 

sequences of equal values, it is converted into a sequence of zeros by the 

differential encoder and then further compressed using the Run Length 

encoder. 

2.4.4 Dictionary techniques 

In many applications, the output of a source consists of recurring patterns. 

A classical example is a text source in which certain patterns 

or words recur frequently. Also, there are certain patterns that simply 

do not occur or, if they do, occurr with great rarity. A very reasonable 

approach to encoding such sources is to keep a list or dictionary 

of frequently occurring patterns. When these patterns appear in the 

source, they are encoded with the reference to the dictionary containing 

the address to the right table location. If the pattern does not 

appear in the dictionary, then it can be encoded using some other, 

less efficient, method. In effect we are splitting the input domain in 

two classes: frequently occurring patterns and infrequently occurring 

patterns. For this technique to be effective, the class of frequently occurring 

patterns, and hence the size of the dictionary, must be much 

smaller than the number of all possible patterns. Depending upon how 

much information is available to build a dictionary, it can be used a 

static or a dynamic approach to the creation of the dictionary. Choos- 

33

34 


ing a static dictionary technique is most appropriate when considerable 

prior knowledge about the source is available. 

When no a priori information is available on the structure of the input 

source an adaptive technique is adopted: for example the UNIX compress 

command makes use of this technique. It starts with a dictionary 

of size 512, thus transmitting codewords 9-bit long. Once the dictionary 

has filled up, the size of the dictionary is doubled to 1024 entries, 

so far transmitting codewords 10-bit long. The size of the dictionary is 

progressively filled up until it contains 216 entries, then compress becomes 

a static coding technique. At this point the algorithm monitors 

the compression ratio: if it falls below a threshold, the dictionary is 

flushed and the dictionary building process is restarted. 

The dictionary techniques are also used in the image compression field 

in the GIF (Graphics Interchange Format) standard, working in a very 

similar way to the compress command. 

2.4.5 Selective readout 

The selective readout technique [10] is a lossless data compression technique 

usually applied in High Energy Physics Experiments. Since really 

interesting data are a small fraction of the total amount of data actually 

produced, it proves useful to transmit and store only those data. 

The selective readout may reduce the data size by identifying regions 

in space containing a significant amount of energy. For example in 

the SDD case, the Central Trigger Processor (CTP) unit defines a Region 

Of Interest (ROI) that, event by event, contains the information 

of which ladders are to be read out and which ones can be discarded. 

Using the ROI feature a very high compression ratio can be achieved.

2.5 — Lossy compression techniques 

2.5 Lossy compression techniques 

This section contains an explanation of the most widely used lossy compression 

techniques. In particular the following items will be covered: 

– zero suppression; 

– transform coding; 

– sub-band coding with some remarks on wavelets. 

The first of these algorithms has been chosen for direct application in 

the 1D compression algorithm implemented in the prototypes CARLOS 

v1 and v2. 

2.5.1 Zero supression 

Zero suppression is the very simple technique of eliminating data samples 

below a certain threshold, by putting them to 0. Zero suppression 

proves to be very useful in data containing large quantities of zeros and 

interesting data concentrated in small clusters: for instance, being the 

mean occupancy of a SDD in the inner layer of 2.5 %, a compression 

ratio of 40 can be obtained by using the zero suppression technique 

only. 

A problem arises since the SDD data and, in general, data collections 

contain the sum of two different distributions: the real signal corresponding 

to the interesting physical event and a white noise with a 

Gaussian distribution around a mean value. So far if a lossy compression 

algorithm obtains a good compression ratio just eliminating the 

noise, the distortion introduced is absolutely acceptable. The key task 

for a fair implementation of the zero suppression technique is the choice 

of the right value of the threshold parameter, in order to eliminate noise 

while preserving the physical signal. 

In the case of data coming out from the SDD detector and related 

front-end electronics, data values are shifted from the 0 level to a baseline 

level greater than 0. This baseline level corresponds to the mean 

35

36 


value of the noise introduced by the preamplification electronics; then 

there is a spread among this value due to the RMS of the Gaussian 

distribution of the noise. 

The noise level introduced by the electronics may vary with time and 

with the amount of radiation absorbed: so far a compression algorithm 

making use of the zero suppression technique has to allow a tunable 

value of the threshold level, in order to accomodate fluctuations or 

drifts in the baseline values. Following this indication, the threshold 

level used in CARLOS v1 and v2 is completely presettable via software 

using the JTAG port. 

2.5.2 Transform coding 

Transform coding [7] takes as input a data sequence and transforms it 

into a sequence in which most part of the information is contained into 

a few samples: so far the new sequence can be further compressed using 

the other compression algorithms described up to now. The key point 

of transform coding is the choice of the transform: this depends on the 

features and redundancies of the input data stream to compress. The 

algorithm, working on N elements at a time, consists of three steps: 

– transform: the input sequence {sn} is split in N-long sequences; 

then each block is mapped, using a reversible transformation, into 

the sequence {cn}. 

– quantization: the transformed sequence {cn} is quantized, i.e. a 

number of bits is assigned to each sample depending on the dynamics 

of the sequence, compression ratio desired and acceptable 

distortion. 

– coding: the quantized sequence {cn} is encoded using a binary 

encoding technique such as Run Length encoding or the Huffman 

coding. 

These concepts can be expressed in a mathematical way: given a sequence 

in input {sn}, it is divided in N-long blocks and it is mapped


using the reversible transform A into the sequence {cn}: 

or, in other terms: 

cn = 

N−1 

i=0 

c = As (2.19) 

sian,i con [A]i,j = ai,j (2.20) 

Quantization and encoding steps are performed on the sequence {cn}, 

so to optimize compression. 

The decompression algorithm, by means of the inverse transform B = 

A −1 , reconstructs the original sequence {sn} from the encoded sequence 

{cn}, in the following way: 

or: 

sn = 

N−1 

i=0 

s = Bc (2.21) 

sibn,i con [B]i,j = bi,j (2.22) 

These concepts can be easily extended to bi-dimensional data, such as 

images or 2-D charge distributions, as in the case of the SDD. 

Let us take a portion N × N of a digital image S, containing Si,j as 

its (i, j)-th pixel; by performing a reversible bi-dimensional transform 

A working on N × N pixels at a time, with ai,j (i, j)-th element of the 

transform matrix A and Ci,j (i, j)-th pixel of the block N × N of the 

compressed image C, the following holds true: 

Ck,l = 

N−1 

i=0 

N−1 

j=0 

Si,jai,jak,l 

(2.23) 

A transform is defined separable if it is possible to apply the 2D transform 

of a N ×N block by applying, first, a 1D transform on the N rows 

of the block and, then, a transform on the N columns of the block, just 

transformed; by choosing a separable transform the (2.23) becomes: 

Ck,l = 

N−1 

i=0 

N−1 

j=0 

Si,jak,ial,j 

(2.24) 

37

38 

or, expressed as a matrix: 


C = ASA T 

The inverse transform is the following one: 

S = BCB T 

(2.25) 

(2.26) 

Frequently orthonormal transforms are used, so that B = A −1 = A T , 

in a way that calculating the inverse trasform reduces to: 

S = A T CA (2.27) 

Even in the bi-dimensional case, in order to reach a high compression 

ratio, a good transform has to be chosen. For instance the JPEG 

standard has adopted, until the year 2000, the use of the Discrete 

Cosine Transform, known as DCT. 

If A is the matrix representing the DCT, the following relationship 

follows: 

 

(2j +1)iπ 

[A]i,j = w(i)cos 

j =0, 1,... ,N − 1 (2.28) 

2N 

where: 

⎧ 

⎨ 

w(i) = 

⎩ 

 

1 

N 

2 

N 

i =0 

i =1,... ,N − 1 

Fig. 2.3 gives a graphical interpretation of (2.28). 

After choosing the transform, the next step consists in the quantization 

of the transformed image. 

Several approaches are possible: for example the zonal mapping foresees 

a preliminary analysis of the transformed coefficients statistics and 

alaterassignmentofafixednumberofbits. 

The name zonal mapping comes from the assignment of a fixed number 

of bits depending on the zone in which each coefficient is placed in the 

square N × N block under study; Tab. 2.6 reports an allocation bit


Figure 2.3: Base coefficients for the bi-dimensional DCT in the case N =8 

8 7 5 3 1 1 0 0 

7 5 3 2 1 0 0 0 

4 3 2 1 1 0 0 0 

3 3 2 1 1 0 0 0 

2 1 1 1 0 0 0 0 

1 0 0 0 0 0 0 0 

0 0 0 0 0 0 0 0 

Table 2.6: Allocation bit table for a 8 × 8block 

table for a 8 × 8block. 

It is interesting to note that quantization in Tab. 2.6 assigns zero bits 

to the coefficients in the lower-right side of the table: actually this is 

equivalent to ignore these coefficients. This kind of quantization makes 

sense since lower-right side coefficients come from a transformation of 

the original image using high frequency cosines, i.e. these coefficients 

contain an information corresponding to the high frequencies in the 

original signal, see Fig. 2.3. 

Since human eye response strongly depends on frequency and, in particular, 

it is sensible to variations at low frequencies and far less sensible 

at higher frequencies, quantization in Tab. 2.6 tends to ignore informations 

that the human eye would not appreciate at all. 

39

40 


After quantization, only non-null coefficients are transmitted. In particular 

for every non-null coefficient, two words have to be transmitted: 

the first with the quantized value of the coefficient itself; the second 

containing the number of null samples occurred after the last non null 

coefficient. This allows the decompression algorithm to exactly reconstruct 

the sequence as it was quantized and, from that, the original 

image. 

As an example, let us suppose to have the 8 × 8 8-bit pixels image 

reported in Tab. 2.7. 

124 125 122 120 122 119 117 118 

121 121 120 119 119 120 120 118 

126 124 123 122 121 121 120 120 

124 124 125 125 126 125 124 124 

127 127 128 129 130 128 127 125 

143 142 143 142 140 139 139 139 

150 148 152 152 152 152 150 151 

156 159 158 155 158 158 157 156 

Table 2.7: 8 × 8 block of a digital image 

Each value of the block is translated of a factor 2p−1 ,wherepis the 

number of bits per pixel (in this case p = 8); then the DCT is applied 

to the block obtaining the coefficients ci,j reported in Tab. 2.8. 

39.88 6.56 -2.24 1.22 -0.37 -1.08 0.79 1.13 

-102.43 4.56 2.26 1.12 0.35 -0.63 -1.05 -0.48 

37.77 1.31 1.77 0.25 -1.50 -2.21 -0.10 0.23 

-5.67 2.24 -1.32 -0.81 1.41 0.22 -0.13 0.17 

-3.37 -0.74 -1.75 0.77 -0.62 -2.65 -1.30 0.76 

5.98 -0.13 -0.45 -0.77 1.99 -0.26 1.46 0.00 

3.97 5.52 2.39 -0.55 -0.051 -0.84 -0.52 -0.13 

-3.43 0.51 -1.07 0.87 0.96 0.09 0.33 0.01 

Table 2.8: DCT coefficients related to the block in Tab. 2.7.


As already stated high-frequency related coefficients in the lower-right 

corner tend to be quite close to 0, while most of the information is 

concentrated in the upper-left corner. 

The quantization of the coefficients is obtained using the reference table 

as in Tab. 2.9; in particular quantized lij values are obtained with 

the following formula: 

 

cij 

lij = +0.5 

(2.29) 

Q t ij 

where Q t ij is the (i,j)-th element of the quantization table and ⌊⌋ is a 

function for which ⌊x⌋ is the greatest integer less than x. 

16 11 10 16 24 40 51 61 

12 12 14 19 26 58 60 55 

14 13 16 24 40 57 69 56 

14 17 22 29 51 87 80 62 

18 22 37 56 68 109 103 77 

24 35 55 64 81 104 113 92 

49 64 78 87 103 121 120 101 

72 92 95 98 112 100 103 99 

Table 2.9: Quantization table 

Tab. 2.10 contains the resulting bit allocation table obtained using the 

values contained in the quantization table Tab. 2.9: 

After studying the structure of matrices like Tab. 2.10, the order chosen 

for sending coefficients is the one shown in Fig. 2.4. 

This choice allows to have a high probability that the final sequence 

contains a lot of zero coefficients; so far this part of the sequence can 

be encoded using the Run-Length technique. 

2.5.3 Subband coding 

A signal can be decomposed in different frequency components (see 

Fig. 2.5) using analog or digital filters, then each resulting signal can 

41

42 


2 1 0 0 0 0 0 0 

-9 0 0 0 0 0 0 0 

3 0 0 0 0 0 0 0 

0 0 0 0 0 0 0 0 

0 0 0 0 0 0 0 0 

0 0 0 0 0 0 0 0 

0 0 0 0 0 0 0 0 

0 0 0 0 0 0 0 0 

Table 2.10: Resulting bit allocation table 

Figure 2.4: Zig-zag scanning pattern for an 8x8 transform 

be encoded and compressed using a specific algorithm. Digital filtering 

[9] involves taking a weighted sum of current and past inputs to the 

filter and, in some cases, the past outputs to the filter. The general 

form of the input-output relationship of the filter is given by: 

N 

M 

yn = aixn−i + 

(2.30) 

biyn−i 

i=0 

i=1 

where the sequence xn is the input to the filter, the sequence yn is 

the output from the filter and the values ai and bi are called the filter 

coefficients. If the input sequence is a single 1 followed by all 0s, the 

output sequence is called the impulse response of the filter. The im-


input signal 

Figure 2.5: Decomposition of a signal in frequency components 

pulse response completely specifies the filter: once we know the impulse 

response of the filter, we know the relationship between the input and 

the output of the filter. Notice that if the bi are all zero, there the 

impulse response will die out after N samples. These filters are called 

finite impulse response or FIR filters. In FIR filters Eq. 2.30 reduces 

to a convolution operation between the input signal and the filter coefficients. 

Filters with the nonzero values for some of the bi are called 

infinite response filters or IIR filters. 

The basic subband coding works as follows: the source is passed 

through a bank of filters (a 3-level filter bank is shown in Fig. 2.6), 

called the analysis filter bank which covers the range of frequencies 

that make up the source; the outputs of the filters are then subsampled 

as in Fig. 2.7. The justification of subsampling is the Nyquist rule and 

its generalization, which tells that for perfect reconstruction we only 

need twice as many samples per second as the range of frequencies. 

This means that it is possible to reduce the number of samples at the 

output of the filter as the range of frequencies is less than the range of 

frequencies at the input of the filter. The process of reducing the number 

of samples is called decimation or downsampling. The amount of 

decimation depends on the ratio of the bandwidth of the filter output 

43

44 


High pass filter 

Low pass filter 













Figure 2.6: An 8-band 3-level filter bank 

to the filter input. If the bandwidth at the output of the filter is 1/M 

of the bandwidth at the input of the filter, the output is decimated by 

a factor of M by keeping every Mth sample. Once the output of the 

filters has been decimated, the output is encoded using one of several 

encoding schemes explained so far. 

Along with the selection of the compression scheme, the allocation of 

bits between the subbands is an important design parameter, since 

different subbands contain differing amounts of information. The bit 

allocation procedure can have a significant impact on the quality of 

the final reconstruction, especially when the information component of 

different bands is very different. 

The decompression phase, in subband coding also named synthesis, 

works as follows: first the encoded samples for each subband are decoded 

at the receiver, then the decoded values are upsampled by inserting 

an appropriate number of zeros between the samples, then the 

upsampled signals are passed through a bank of reconstruction filters 

and added together.


input 

signal 

H ~ 

~ 

G 

ν 

Downsampling 

Analysis filter 1 

ν 

Analysis filter 2 

2 

2 

Downsampling 

Encoder 1 

Encoder 2 

Figure 2.7: Subband coding technique: analysis filter bank, downsampling 

and encoding 

Subband coding has applications in speech coding and audio coding 

with the MPEG audio, but can be applied also to image compression. 

2.5.4 Wavelets 

Another method of decomposing signals that has gained a great deal 

of popularity in recent years is the use of wavelets [11, 12, 13, 14]. 

Decomposing a signal in terms of its frequency content using sinusoids 

results in a very fine resolution in the frequency domain. However 

siinusoids are defined on the time domain from −∞ to ∞, therefore 

individual frequency components give no temporal resolution [15]. 

In a wavelet representation, a signal is represented in terms of functions 

that are localized both in time and in frequency. For instance, the 

following is known as the Haar wavelet: 

ψ0,0(x) = 

 

1 0 ≤ x< 1 

2 

−1 1 ≤ x

46 

ψ 

0,0 


ψ 

2,0 

ψ 

ψ 

1,0 1,1 

ψ 

2,1 

Figure 2.8: The Haar wavelet 

ψ 

2,2 

From this “mother” function the following set of functions can be obtained: 

 

ψj,k(x) =ψ0,0(2 j x − k) = 

1 k2 −j ≤ x


(a) (b) 

(c) (d) 

Figure 2.9: Example of multiresolution analysis 

In 1989, Stephane Mallat ([16]) developed the multiresolution approach, 

which moved the representation using wavelets into the domain of subband 

coding. These concepts can be better understood with the help of 

an example. Let us suppose we have to approximate the function f(t) 

drawn in Fig. 2.9a using the translated versions of some time-limited 

function φ(t). The indicator function is a simple approximating function: 

 

1 0 ≤ t

48 


and c0,k are the average values of the function in the interval [k − 1,k). 

In other words: 

c0,k = 

k+1 

It is possible to scale φ(t) to obtain: 

 

φ1,0(t) =φ0,0(2t) = 

Its translates would be given by: 

k 

f(t)φ0,k(t)dt (2.37) 

1 0 ≤ t< 1 

2 

0 otherwise 

(2.38) 

φ1,k(t) =φ1,0(t − k) (2.39) 

 

= φ0,0(2t − k) = (2.40) 

1 

0 

 

0 ≤ 2t − k


is accurately represented by φ1 f (t). φ1f (t) can be decomposed into a 

lower resolution version of itself, namely φ0 f (t) and the difference φ1f (t) 

- φ0 f (t). Let us examine this function over an arbitrary interval [k,k+1): 

φ 1 f (t) − φ0f (t) = 

 

c0,k − c1,2k k ≤ t

50 


2. If a function can be expressed exactly by a linear combination of 

the set {φj,k(t)}, then it can also be expressed exactly as a function 

of the set {φl,k(t)} for all l ≥ j. 

3. The complete set {φj,k(t)} ∞ j,k=−∞ 

tions with the property that: 

∞ 

−∞ 

can exactly represent all func- 

|f(t)| 2 < ∞ (2.52) 

4. If a function f(t) can be exactly represented by the set {φ0,k(t)}, 

then any integer translate of the function f(t − k) can also be 

represented exactly by the same set. 

5. 

 

φ0,l(t)φ0,k(t)dt = 

 

0 l = k 

1 l = k 

(2.53) 

The set forms a multiresolution analysis [16]. So far at any resolution 

2−j every function f(t) can be decomposed in two components: one 

that can be expressed as a function of the set {φj,k(t)} and one that 

can be expressed as a linear combination of the wavelets {ψj,k(t)}. 

The mother wavelet ψ0,0(t) and the scaling function φ0,0(t) are related 

in the following manner: from Property 2, φ0,0 can be written in terms 

of φ1,k. If the relationship is given by: 

Then the wavelet ψ0,0(t) isgivenby: 

φ0,0(t) = hnφ1,n(t) (2.54) 

ψ0,0(t) = (−1) n hnφ1,n(t) (2.55) 

From this relationship we can assume that the wavelet decomposition 

can be implemented in terms of filters with impulse responses given 

by (2.54) and (2.55) and that the filters are quadrature mirror filters. 

Most of the orthonormal wavelets are nonzero over an infinite interval. 

Therefore the corresponding filters are IIR filters. Well known

2.6 — Implementation of compression algorithms 

exceptions are the Daubechies wavelets that correspond to FIR filters. 

Once obtained the coefficients of the FIR filters, the procedure for compression 

using wavelets is identical to the one described for subband 

coding. From now on the terms multiresolution analysis and waveletbased 

analysis will be regarded as synonymous. Some of the most used 

wavelets families are shown in Fig. 2.10, Fig. 2.11 and Fig. 2.12. 

2.6 Implementation of compression algorithms 

Compression algorithms can be implemented in hardware or in software, 

depending on the required speed. When speed is the most important 

constraint on the choice of the implementation of the compression 

algorithm, hardware implementation becomes necessary. 

Commercial devices exist implementing data compression in hardware: 

for example the ALDC1-40S-M from IBM featuring an adaptive lossless 

data compression works at a rate of 40 MBytes/s, while the AHA32321 

chip from Aha can compress and decompress data at 10 MBytes/s with 

a clock frequency of 40 MHz. These rates are far too small than the 

one required for what concerns the SDD readout: in fact the compression 

chip we need has to face an input data rate of 320 MByte/s. 

No commercial chip exists with such features, so we had to design an 

Application Specific Integrated Circuit (ASIC) targeted to our requirements. 

51

52 

Haar 

haar 

Wavelet function psi 

1 

0.5 

0 

−0.5 


0 0.2 0.4 0.6 0.8 1 

−1 

Scaling function phi 

1 

0.8 

0.6 

0.4 

0.2 

0 

0 0.2 0.4 0.6 0.8 1 

Decomposition low−pass filter 

Decomposition high−pass filter 

0 1 

−0.5 

0 

0.5 

0 1 

−0.5 

0 

0.5 

Reconstruction high−pass filter 

Reconstruction low−pass filter 

0 1 

−0.5 

0 

0.5 

0 1 

−0.5 

0 

0.5 

Daubachies 

db1 db2 db3 db10 








0 1 2 3 4 5 

−1 

−0.5 

0 

0.5 

1 

1.5 

1.5 

1 

0.5 

0 

−0.5 

−1 

0 0.5 1 1.5 2 2.5 3 

1 

0.5 

0 5 10 15 

−0.4 

−0.2 

0 

0.2 

0.4 

0.6 

0.8 

1 

1 

0.5 

0 

0.5 

0.5 

0 

−0.5 

0 

0 

−0.5 

0 5 10 15 

−1 

0 1 2 3 4 5 

0 0.5 1 1.5 2 2.5 3 

0 0.2 0.4 0.6 0.8 1 

−1 


1 

0.8 

0.6 

0.4 

0.2 

0 

0 0.2 0.4 0.6 0.8 1 





0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

0 1 2 3 4 5 


0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

0 1 2 3 4 5 


0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

0 1 2 3 


0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

0 1 2 3 


0.5 

0.5 

0 

0 

0 2 4 6 8 10 12 14 16 18 

−0.5 

0 2 4 6 8 10 12 14 16 18 

−0.5 

0 1 

−0.5 

0 

0.5 

0 1 

−0.5 

0 

0.5 




0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

0 1 2 3 4 5 


0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

0 1 2 3 4 5 


0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

0 1 2 3 


0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

0 1 2 3 



0.5 

0.5 

0 

0 

0 2 4 6 8 10 12 14 16 18 

−0.5 

0 2 4 6 8 10 12 14 16 18 

−0.5 

0 1 

−0.5 

0 

0.5 

0 1 

−0.5 

0 

0.5 

Figure 2.10: Some functions belonging to different wavelet families: note 

that db1 is equivalent to the Haar

Symlets 

sym2 sym3 sym4 sym8 







2.6 — Implementation of compression algorithms 

1 

0 1 2 3 4 5 6 7 

−1 

−0.5 

0 

0.5 

1 

1.5 


0 1 2 3 4 5 6 7 

−0.2 

0 

0.2 

0.4 

0.6 

0.8 

1 

1.2 


0 1 2 3 4 5 

−1.5 

−1 

−0.5 

0 

0.5 

1 

1 

0 0.5 1 1.5 2 2.5 3 

−1.5 

−1 

−0.5 

0 

0.5 

1 

1 

0.5 

0.5 

0.5 

0 

0 

−0.5 

0 

0 5 10 15 

0 5 10 15 

−0.2 

0 

0.2 

0.4 

0.6 

0.8 

1 

0 1 2 3 4 5 

0 0.5 1 1.5 2 2.5 3 





0 2 4 6 8 10 12 14 

−0.5 

0 

0.5 

0 2 4 6 8 10 12 14 

−0.5 

0 

0.5 

0.5 

0 

−0.5 

0 1 2 3 4 5 6 7 

0.5 

0 

−0.5 

0 1 2 3 4 5 6 7 


0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

0 1 2 3 4 5 


0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

0 1 2 3 4 5 


0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

0 1 2 3 


0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

0 1 2 3 





0 2 4 6 8 10 12 14 

−0.5 

0 

0.5 

0 2 4 6 8 10 12 14 

−0.5 

0 

0.5 

0.5 

0 

−0.5 

0 1 2 3 4 5 6 7 

0.5 

0 

−0.5 

0 1 2 3 4 5 6 7 


0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

0 1 2 3 4 5 


0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

0 1 2 3 4 5 


0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

0 1 2 3 


0.8 

0.6 

0.4 

0.2 

0 

−0.2 

−0.4 

0 1 2 3 








1.5 

0 1 2 3 4 5 

−1 

−0.5 

0 

0.5 

1 

1.5 

2 

1.5 

1 

1 

1 

0.5 

1 

0.5 

0.5 

0 

0 

0.5 

0 

−0.5 

−0.5 

−0.5 

0 2 4 6 8 10 

−0.2 

0 

0.2 

0.4 

0.6 

0.8 

1 

1.2 


0 

0 5 10 15 20 25 

0 5 10 15 20 25 

−0.2 

0 

0.2 

0.4 

0.6 

0.8 

1 

0 5 10 15 

0 5 10 15 

−0.2 

0 

0.2 

0.4 

0.6 

0.8 

1 

0 2 4 6 8 10 

0 1 2 3 4 5 









0 4 8 12 16 20 24 28 

−0.5 

0 

0.5 

0 4 8 12 16 20 24 28 

−0.5 

0 

0.5 

0 2 4 6 8 10 12 14 16 

−0.5 

0 

0.5 

0 2 4 6 8 10 12 14 16 

−0.5 

0 

0.5 

0 2 4 6 8 10 

−0.5 

0 

0.5 

0 2 4 6 8 10 

−0.5 

0 

0.5 

0.5 

0.5 

0 

−0.5 

0 

−0.5 

0 1 2 3 4 5 

0 1 2 3 4 5 








0 4 8 12 16 20 24 28 

−0.5 

0 

0.5 

0 4 8 12 16 20 24 28 

−0.5 

0 

0.5 

0 2 4 6 8 10 12 14 16 

−0.5 

0 

0.5 

0 2 4 6 8 10 12 14 16 

−0.5 

0 

0.5 

0 2 4 6 8 10 

−0.5 

0 

0.5 

0 2 4 6 8 10 

−0.5 

0 

0.5 

Figure 2.11: Some functions belonging to different wavelet families 

Coiflets 

coif1 coif2 coif3 coif5 


0.5 

0.5 

0 

−0.5 

0 

−0.5 

0 1 2 3 4 5 

0 1 2 3 4 5 

53

54 

Biorthogonal Wavelets 

bior1.1 bior1.3 bior1.5 bior6.8 

Decomposition wavelet function psi 

1.5 

1 

0.5 

Decomposition scaling function phi 





0 0.2 0.4 0.6 0.8 1 

−1 

−0.5 

0 

0.5 

1 



0 2 4 6 8 

−1 

−0.5 

0 

0.5 

1 

0 1 2 3 4 

−1 

−0.5 

0 

0.5 

1 

1 

0.5 

0 

0 1 2 3 4 


0.5 

0 

−0.5 

0 1 2 3 4 5 

1 

0 

−0.5 

1 

0.5 

0 

0 2 4 6 8 


0.5 

0 

−0.5 

0 1 2 3 4 5 6 7 8 9 

0.5 

0 5 10 15 

1 

0.5 

0 

0 5 10 15 


0 0.2 0.4 0.6 0.8 1 

0 



0.5 

0 

−0.5 

0 1 2 3 4 5 6 7 8 9 


0.5 

0 

−0.5 

0 1 2 3 4 5 


0.5 

0 

−0.5 

0 1 


0.5 

0 

−0.5 

0 1 


0.5 

0 

−0.5 

0 2 4 6 8 10 12 14 16 

0.5 

0 

−0.5 

0 2 4 6 8 10 12 14 16 

Reconstruction wavelet function psi 

Reconstruction scaling function phi 

0 2 4 6 8 

−1 

−0.5 

0 

0.5 

1 




1 

0.5 

0 

−0.5 

−1 

0 1 2 3 4 



1 

0.5 

0 

−0.5 

−1 

0 0.2 0.4 0.6 0.8 1 


0 5 10 15 

−0.5 

0 

0.5 

1 

1 

1 

1 

1 

0.5 

0.5 

0.5 

0.5 

0 5 10 15 

0 

0 2 4 6 8 

0 

0 1 2 3 4 

0 

0 0.2 0.4 0.6 0.8 1 

0 



0.5 

0 

−0.5 

0 2 4 6 8 10 12 14 16 

0.5 

0 

−0.5 

0 2 4 6 8 10 12 14 16 


0.5 

0 

−0.5 

0 1 2 3 4 5 6 7 8 9 


0.5 

0 

−0.5 

0 1 2 3 4 5 6 7 8 9 


0.5 

0 

−0.5 

0 1 2 3 4 5 


0.5 

0 

−0.5 

0 1 2 3 4 5 


0.5 

0 

−0.5 

0 1 


0.5 

0 

−0.5 

0 1 

Reverse Biorthogonal Wavelets 

rbio1.1 rbio1.3 rbio1.5 rbio6.8 




1 

0.5 

0 

−0.5 

−1 

0 1 2 3 4 


0 0.2 0.4 0.6 0.8 1 

−1 

−0.5 

0 

0.5 

1 



0 5 10 15 

−0.5 

0 

0.5 

1 

1 

1 

1 

0.5 

0.5 

0.5 

0 5 10 15 

0 

0 2 4 6 8 

−1 

−0.5 

0 

0.5 

1 



1 

0.5 

0 2 4 6 8 

0 

0 1 2 3 4 

0 

0 0.2 0.4 0.6 0.8 1 

0 



0.5 

0 

−0.5 

0 2 4 6 8 10 12 14 16 

0.5 

0 

−0.5 

0 2 4 6 8 10 12 14 16 


0.5 

0 

−0.5 

0 1 2 3 4 5 6 7 8 9 


0.5 

0 

−0.5 

0 1 2 3 4 5 6 7 8 9 


0.5 

0 

−0.5 

0 1 2 3 4 5 


0.5 

0 

−0.5 

0 1 2 3 4 5 


0.5 

0 

−0.5 

0 1 


0.5 

0 

−0.5 

0 1 


1.5 

1 

0.5 






0 0.2 0.4 0.6 0.8 1 

−1 

−0.5 

0 

0.5 

1 



0 1 2 3 4 

−1 

−0.5 

0 

0.5 

1 

1 

0.5 

0 

0 1 2 3 4 


0.5 

0 

−0.5 

0 1 2 3 4 5 

1 

0 

−0.5 

0.5 

0 5 10 15 

1 

0.5 

0 

0 5 10 15 


0 2 4 6 8 

−1 

−0.5 

0 

0.5 

1 

0 0.2 0.4 0.6 0.8 1 

0 


0.5 

0 

−0.5 

0 2 4 6 8 10 12 14 16 

0.5 

0 

−0.5 

0 2 4 6 8 10 12 14 16 


0.5 

0 

−0.5 

0 1 2 3 4 5 6 7 8 9 

1 

0.5 

0 

0 2 4 6 8 


0.5 

0 

−0.5 

0 1 2 3 4 5 6 7 8 9 


0.5 

0 

−0.5 

0 1 2 3 4 5 


0.5 

0 

−0.5 

0 1 


0.5 

0 

−0.5 

0 1 

Figure 2.12: Some functions belonging to different wavelet families: note 

that bior1.1 and rbior1.1 are equivalent to the haar

Chapter 3 

1D compression algorithm 

and implementations 

3.1 Compression algorithms for SDD 

The choice of the algorithm for SDD data compression is strictly related 

to the input data stream features: 

– low detector occupancy (max 3 %) 

– small samples are much more probable than high samples 

The first feature suggests the use of a zero suppression algorithm: all 

samples below a certain value (depending on the noise distribution) 

are discarded. The second feature suggests to adopt an entropy coder, 

such as the Huffman one. Beside that it is important for the algorithm 

to contain software tunable parameters in order to re-optimize the algorithm 

performance in case of changes on the statistics of the input 

distribution. For instance the threshold level has to be changeable via 

software in order to take into account of changes on the signal to noise 

ratio over the years, so the Huffman tables have to be reconfigurable 

too. The other important features for the compression algorithms are: 

– they have to be fast 

55

56 

1D compression algorithm and implementations 

– they have to be simple to implement in hardware 

– they have to allow lossless data transmission 

For the development of the compression algorithms, studies have been 

performed on the statistical distribution of the sample data coming 

from the single-particle events of three beam tests, so that noise could 

be properly taken into account. The compression results have been 

evaluated in order to verify the algorithm efficiency and the best parameter 

values. 

3.2 1D compression algorithm 

Following these requirements the INFN Section of Torino has chosen 

a sequential compression algorithm [17] which scans data coming from 

each anode row as uni-dimensional data streams. As shown in Fig. 3.1 

as an example, data samples coming from anode 76 are processed, then 

from anode 77 and so on. The ultimate goal of the algorithm is to 

save data belonging to a cluster, while rejecting all the other samples 

regarded as noise. To have a data reduction system that is applicable 

to all the situations, the algorithm is provided with different tuning 

parameters (Fig. 3.2 provides a graphical explanation of them): 

– threshold: the threshold parameter is applied to the incoming 

samples, forcing the differences to zero if they are smaller than 

this value. This parameter has the goal of eliminating noise and 

pedestals affecting data. 

– tolerance: the tolerance parameter is applied to differences calculated 

between consecutive samples, forcing them to zero if they 

are less than this value (using this mechanism samples not very 

different are considered equal). So far non significant fluctuations 

of the input values are eliminated using the tolerance mechanism. 

– disable: the disable parameter is applied to the input data, removing 

all previous mechanisms for samples greater than disable

3.2 — 1D compression algorithm 

Figure 3.1: Cluster in two dimensions and its slices along the anode direction 

in order to have full information on the clusters and to maintain 

good double peak resolution. This means that the important information 

is not affected by the lossy compression algorithm. 

The 1D algorithm actually consists of 5 processing steps sequentially 

applied (see Fig. 3.3): 

– first the input data values below the threshold parameter value 

are put to 0; 

– then, the difference between a sample and the previous one (along 

the time direction) is calculated; 

– if the difference value is smaller than the tolerance parameter and 

if the input sample is smaller the the disable parameter, then the 

difference value is put to 0, otherwise its value is left unchanged; 

– these values are then encoded using the Huffman table; 

– the obtained values are then encoded using the Run Length encoding 

method. 

57

58 


disable 

anodic signal 

+tolerance 

−tolerance 

threshold 

Figure 3.2: Threshold, tolerance and disable parameters 

The high probability of finding long zero sequences in the SDD charge 

distribution makes the Run Length encoding use very effective, especially 

when combined with threshold, tolerance and disable mechanisms. 

3.3 1D algorithm performances 

As explained in Chapter 1 in order to comply with the target figures of 

DAQ speed and magnetic tape usage, the size of the SDD event has to 

be reduced from 32.5 MBytes to about 1.5 MBytes, which corresponds 

to a target compression coefficient of 22. Several standard compression 

algorithms have been evaluated on SDD test beam events data in order 

to have an estimation of the compression performances achievable: the 

best compression coefficient has been obtained with the gzip utility 

implemented in the Unix operating system, so far it was chosen for 

comparison with our 1D algorithm. The data was submitted to the 

gzip program into a binary format for a fair comparison. 

time

3.3 — 1D algorithm performances 

software tunable parameters 

threshold 

tolerance 

Huffman tables 

input stream 

simple threshold zero suppression 

differential encoding 

tolerance 

Huffman encoding 

run length encoding 

compressed data 

Figure 3.3: 1D compression algorithms 

3.3.1 Compression coefficient 

For the comparison task data coming from the August 1998 test beam 

was chosen. The gzip compression algorithm achieves a compression 

ratio around 2: this value is too far from our target value of 22. 

The 1D compression algorithm has been applied using a threshold value 

of 20 = 1 ∗ noise mean +1.35 ∗ noise RMS and tolerance =0: the 

compression value obtained is around 12.5. This is still an unacceptable 

value for our purposes. The goal compression value of 22 can 

only be reached by increasing the threshold parameter, which implies a 

larger information loss. For instance by applying the algorithm on the 

same test beam data it is possible to obtain a compression coefficient 

of about 33, with threshold =40=1∗noise mean+2.68∗noise RMS 

and tolerance = 0. Fig. 3.4 shows the variation of the compression 

coefficient using the 1D algorithm as a function of the threshold level 

between 20 and 40 and for two values of tolerance. 

59

60 


Figure 3.4: 1D compression ratio as a function of threshold and tolerance 

An important feature of this compression algorithm is that it can be 

reversed to a lossless algorithm simply by putting the values of threshold 

and tolerance to 0. Sending data without losing any information 

will be very useful for the first event acquisitions since raw data will be 

analyzed for determing statistics, noise and so on. These raw data will 

also be used for determining the best Huffman tables, the ones allowing 

to obtain the best compression coefficient. When used in lossless mode, 

meaning that only differential encoding, Huffman and run length encoding 

are applied, the compression coefficient obtained is 2.3, that is 

even better than what we obtain with the gzip algorithm. 

3.3.2 Reconstruction error 

So far it was to be checked if the information loss introduced with a 

threshold level of 40 is acceptable or not. In particular it was decided to 

study how much data compression and decompression affected clusters 

geometry for what concerns centroid position and charge. 

A cluster finding routine was developed with the following two step 

procedure:

3.3 — 1D algorithm performances 

Figure 3.5: Spreads introduced by data compression on measurement of 

coordinates of the SDD clusters and of the cluster charge (bottom right) 

– data streams are analyzed one anode row after the other: when 

a sample value is higher than a certain threshold level for two 

consecutive time bins, it is considered to be a hit until it goes 

below the same threshold for two consecutive time bins; 

– then if any two 1-D hits from adjacent anodes overlap in time they 

are considered as a part of a two-dimensional cluster. 

After finding samples belonging to clusters they are fitted with a twodimensional 

Gaussian function, with the following features: 

– the mean value corresponds to the cluster centroid; 

– the sigma value corresponds to the centroid resolution; 

– the volume under the Gaussian function corresponds to the charge 

released on the detector by the ionizing particle. 

61

62 


1D compression and decompression algorithms were then applied on 

test beam data, performed cluster finding and analysis on both data: 

the results are shown in Fig. 3.5. The picture on the upper left shows 

the distribution of the differences in the centroid coordinates before 

and after compression along the anode and drift time direction. The 

picture on the upper right shows the same distribution on the drift time 

direction, while the picture on the bottom left shows the distribution 

along the anode direction. These plots show that the compression algorithm 

with a threshold of 40 does not introduce biases on the centroid 

coordinate measurements, but that worsen their accuracy by about 9 

µm (+4%) along the anode direction and by about 16 µm (+8%) along 

the drift time axis. The bottom right picture shows the percentual difference 

of charge before and after compression: so far the 1D algorithm 

also introduces an underestimation of the cluster charge of about 4 %. 

3.4 CARLOS v1 

During 1999 I have collaborated with INFN group in Torino for the 

design and test of a first hardware implementation of the 1D algorithm: 

CARLOS v1. This device is physically implemented as a PCB (Printed 

Circuit Board) containing 2 FPGAs (Field Programmable Gate Array) 

circuits and some connectors for use in a test beam data acquisition 

system, as shown in Fig. 3.6. The device processes data coming from 

one macrochannel only, that is data coming from one half-detector, and 

directly interfaces the SIU board, the first stage of the DAQ system. 

3.4.1 Board description 

The main two processing blocks mounted on the board are the two 

Xilinx FPGA devices. An FPGA is a completely programmable device 

widely used for fast prototyping before the final implementation of 

the design on an ASIC circuit which requires more resources as far as

3.4 — CARLOS v1 

Figure 3.6: CARLOS prototype v1 picture 

time, money and design efforts. An FPGA contains a matrix of CLBs 

(Configurable Logic Blocks) that can be individually programmed and 

connected together in order to implement the desired input/output 

logic function. Each CLB contains a SRAM (Static RAM) that is used 

to implement a logic function by putting the input values on the address 

bus: they are used as look-up tables. 

An other piece of silicon area on the FPGA die contains the configuration 

RAM : depending on the contents of this block the device 

will accomplish different logic functions. The configuration RAM is 

written on power-on from an external EPROM: CARLOS v1 hosts two 

EPROM devices for the configuration of the two FPGAs. The process 

of configuration takes around 20 ms, after which the devices are 

completely operational. A 10 MHz clock generator is hosted between 

the EPROM chips: we could not achieve a higher working frequency 

with our choice of FPGA device. In fact the final operating frequency 

63

64 


Features Values 

Logic cells 2432 

Max logic gates (no RAM) 25k 

Max RAM bits (no logic) 32768 

Typical gate range (logic and RAM) 15k - 45k 

CLB matrix 32x32 

Total CLBs 1024 

Number of flip-flops 2560 

Number of user I/O 256 

Table 3.1: XC4025 Xilinx FPGA main features 

is a function of how many internal resources are being used: the more 

resources are used, the slower becomes the final working frequency. 

With the final 10 MHz frequency we reached a good trade-off between 

logic complexity and speed; furthermore this frequency was sufficient 

for application in a test-beam environment. Tab. 3.1 reports the main 

features of the chosen FPGA devices XC4025E-4 HQ240C. 

The board also contains 3 connectors from left to right: 

– the first is used for data injection into the first FPGA device using 

a Hewlett Packard (HP) pattern generator; 

– the second one is used for analyzing data coming out from the 

first device by making use of a logic analyzer probe; 

– the third connector is used for the communication between CAR- 

LOS v1 and the SIU board. Fig. 3.7 shows a picture of the final 

SIU board. We used a SIU simplified version called SIMU (SIU 

simulator), distributed at CERN for helping front-end designers to 

realize DAQ-compatible devices. The SIMU board can be directly 

plugged onto this connector.


Figure 3.7: Picture of the SIU board 

3.4.2 CARLOS v1 design flow 

I have carried out the design of the second FPGA device following the 

digital design flow shown in Fig. 3.8. In particular the design flow is 

composed by the following steps: 

– block specifications have been coded with the VHDL language 

using a hierarchical structure starting from the bottom layer up 

to the top-level; 

– each VHDL model has been simulated in order to debug the code 

using the Synopsys simulator software; 

– each VHDL model has been synthesized, that means translated to 

a netlist, using the Synopsys synthesis tool; the netlist contains 

usual standard cells such as AND, OR or flip-flops, but the FPGA 

device does not contain these elements, it contains only RAM 

blocks. The netlist is only a logic representation of the circuit 

itself, it has no physical meaning. 

– the netlist is simulated using the Synopsys simulator, taking into 

account cell timing delays and constraints. 

– the netlist is automatically converted into a physical layout using 

65

66 


Figure 3.8: Digital design flow for CARLOS v1 

the place and route software Alliance from Xilinx. 

– the layout information is put in a binary file ready to be downloaded 

on the EPROM chip using the Alliance software, together 

with an EPROM programmer. 

This is a very straight-forward and automated process; besides the 

time needed between a slight modification in the VHDL code and its 

actual implementation in the FPGA device is very short. This is the 

main reason why FPGAs are so widely used for prototyping. An other 

very important reason is the following one: running millions of test 

vectors as a software simulation of a VHDL model is a very long process 

even for fast machines; the same set of test vectors can be run in a 

few seconds on the hardware prototype. FPGA implementation easily 

allows algorithms verification on a huge amount of data.


3.4.3 Functions performed by CARLOS v1 

The FPGA on the left in Fig. 3.6 contains the 1D compression algorithm, 

as explained in the previous sections, composed of 5 processing 

blocks sequentially applied to the input data. The blocks form a 5-level 

pipeline chain, each one requiring one clock cycle. The variable-length 

compression coefficients are produced as 32-bit long words. 

The FPGA on the right contains the following blocks: 

– firstcheck: this block processes 32-bit input words coming from 

the compressor FPGA: if the MSB is high the incoming data is 

rejected, otherwise it is accepted and splitted in two different data 

words, one 26-bit wide containing the variable length code and one 

5-bit one containing the information of how many bits have to be 

stored. 

– barrel: this block packs 2 to 26 bits variable length codes in fixedsize 

32 bits words. The information of how many bits from 2 to 26 

have to be stored is contained in the 5-bit length bus coming from 

the firstcheck block. Variable length Huffman codes packed in 32bit 

words can be uniquely unpacked by using the Huffman table 

and starting from the MSB to LSB. When a word is complete an 

output-push signal is asserted. 

– fifo: it contains a 64x32 RAM memory wide for storing data coming 

out of the barrel shifter. When the FIFO contains at least 

16 data words it asserts a query signal in order to ask the feesiu 

block to begin data popping. 

– feesiu: this is the most complex block of the prototype containing 

the interface between CARLOS and the SIU board. The main behavior 

is quite simple: CARLOS waits for a “Ready to Receive” 

(RDYRX) command from the SIU on a bidirectional data bus; 

after receiving it CARLOS takes possession of the bidirectional 

bus and begins sending data towards the SIU as 17 32-bit words 

packets. Each packet is built as a header word containing exter- 

67

68 


nally hardwired informations and 16 data words coming out of the 

FIFO. When the FIFO is empty or it does not contain 16 data 

words, no valid data is sent to the SIU. Otherwise if a FIFO begins 

to acquire large quantities of data and the connection to the SIU 

is not still open (a RDYRX command has not been received yet) 

a data-stop signal is asserted for stopping the data stream coming 

into CARLOS from AMBRA. 

3.4.4 Tests performed on CARLOS v1 

The test of the CARLOS prototype has been carried on using the pattern 

generator and logic analyzer HP16700A at the INFN Section in 

Torino. Data were injected on the first connector, analyzed on the 

second connector, while the third one has been connected to a SIU 

extender board, which directly connects to the SIMU board. The SIU 

extender is very useful for debugging purposes since it provides 5 logic 

analyzer compatible connectors for analyzing signals being exchanged 

in the interface CARLOS-SIU. Here follows a list of the test performed 

on CARLOS: 

1. functional test and compression algorithm verification; 

2. opening of a transaction by manually pushing buttons on the 

SIMU board; 

3. event data transmission from CARLOS to the SIMU. The SIMU 

does not store data, so the only way to check if data are correct 

on not is by using the logic analyzer. 

Prototype test was especially useful in order to design a perfectly compatible 

interface towards the SIU. The main difficulty in testing the 

interface towards the SIU without a SIU board is due to the presence 

of bidirectional pads: it is quite a difficult job to work with such pads 

using a pattern generator. 

Many corrections had to be applied to the original version in order to


have a 100% compatible interface. The final VHDL version was then 

frozen and then used for the ASIC implementation of CARLOS v2. 

The VHDL model, in fact, does not depend on the technology chosen 

for the implementation and is completely re-usable. 

3.5 CARLOS v2 

The first CARLOS prototype has been very useful for testing the compression 

algorithm on a huge amount of data and for correctly designing 

complex blocks as the interface towards the SIU, but it has many limitations 

if compared to the final version we need to design. So far we 

decided to pass to a second prototype of CARLOS with the following 

features: 

– 40 MHz clock frequency; 

– 8 macro-channels parallel processing; 

– small size for an easier use in test-beam environment; 

– a JTAG port for downloading the Huffman look-up tables, the 

threshold and tolerance values . 

The CARLOS chip design has been logically divided into two main 

parts, the first one designed in Torino and the second one in Bologna: 

– a data compressor on 8 incoming streams, using the 1D compression 

algorithm. The compressor accepts 8-bit input data and gives 

as output 32-bit words containing the variable length codes. 

– a data packing and formatting block, a multiplexer selecting which 

one of the 8 incoming streams has to be sent in output and an 

interface block towards the SIU. 

As you can see in Fig. 3.9 the main sub-blocks are 6: firstcheck, barrel, 

fifo, event-counter, outmux, feesiu. 

69

70 


Figure 3.9: CARLOS v2 schematic blocks


3.5.1 The firstcheck block 

The I/O signals are: 

– inputdata: input 32-bit bus; 

– ck: input signal; 

– reset: input signal; 

– load: output signal; 

– addressvalid: output 5-bit bus; 

– datavalid: output 26-bit bus. 

The firstcheck block takes as input the compressed codes coming from 

the compression block and selects the useful bits while rejecting the 

dummy ones. In fact the 32-bit input word has the following structure: 

– bit 31: under-run bit: when set to 1 it means that incoming data 

are dummy and have to be discarded; this may happen, for example, 

when the run length encoder is packing long zeros sequences, 

thus temporarily interrupting the data flow towards the SIU. 

– bit 30 to 26: this 5-bit word contains the actual number of bits 

that have to be selected by the following logic block, the barrel 

shifter. 

– bit 25 to 0: this 26-bit word contains the compressed code. 

The real interesting bits are usually much less than 26, thus obtaining 

a reduction in the data stream volume. 

The firstcheck behavior is quite simple: when the reset signal is active 

(active high) all outputs are set to 0; when reset is inactive the 

firstcheck block samples the under-run bit value: when 1 all outputs are 

set to 0, when 0 load is set to 1, addressvalid is assigned inpudata(30 

downto 26) and datavalid is assigned inputdata(25 downto 0). 

71

72 


3.5.2 The barrel shifter block 


– input: input 26-bit bus; 

– sel: input 5-bit bus; 

– load: input signal; 



– end-trace: input signal; 

– output-push: output signal; 

– output: output 32-bit bus. 

The barrel shifter has to pack all the valid data coming out from the 

firstcheck block into a fixed-length 32-bit register word to be put in output: 

in this way all dummy data are rejected and we have no more any 

distinction between data-length and data itself. All data are packed in 

the same word and can be easily reconstructed by using the Huffman 

tree decoding scheme. If an input data cannot be completely stored 

into a 32-bit word, it is broken into 2 pieces: the first as the MSBs of 

the current output so to completely fill it, the second as the LSBs of 

the following valid output word. 

When the reset is active all internal registers and outputs are set to 

0, when the reset is inactive the barrel shifter begins to wait for valid 

data coming from the firstcheck block, that is data with the load signal 

set to 1. When it happens the barrel shifter selects the valid bits from 

input and packs them together in a 64-bit circular register word. When 

32 bits are written on the register, the block asserts a signal outputpush 

high to communicate to the following block (the FIFO) that the 

output is valid and has to be stored. 

Two situations are very important for the barrel shifter working properly: 

when the load signal changes from 1 to 0 the barrel stops packing


data and when load turns to 1 again the barrel begins packing data as 

if no pause had happened. 

The end-trace signal is asserted for one clock period in coincidence with 

the last valid data: this data has to be packed together with the others, 

then the 32-bit word has to be pushed in output (by putting outputpush 

to 1) even if it is not complete. After the end-trace and after 

the last valid word has been sent to output the barrel shifter puts n 

zero words as valid outputs: that number depends on how many words 

have been sent to output from the beginning of the current event. In 

fact the total number of valid words per event has to be an integer 

multiple of 16. So far if (16k + 7) words have been sent in output after 

the end-trace gets active n=9 zero words follows with output-push set 

to 1. This condition is strictly related to the data transmission policy 

and multiplexing of the 8 incoming data streams onto a single 32-bit 

output, as will be explained in the next paragraph. 

3.5.3 The fifo block 


– datain: input 32-bit bus; 


– push: input signal; 

– pop: input signal; 


– empty: output signal; 

– full: output signal; 

– query: output signal; 

– dataout: output 32-bit bus. 

The fifo block contains a double-port RAM block with 64 32-bits words 

plus some control logic. Its purpose is to buffer the input data stream 

73

74 


and derandomize the queues that are waiting to be served by the outmux 

block. The buffer memory has to be large enough so to allow data 

storing when the other queues are being served, since we have to avoid 

block conditions. On the other side it cannot be too large since CAR- 

LOS hosts 8 fifo blocks and the chip area is a strong design constraint. 

The fifo allows 3 main storage operations: 

– write only; 

– read only; 

– read/write at the same time but at different cell locations. 

The FIFO allows to write data coming from the barrel shifter and to 

read them when the queue has to be served by the outmux block. The 

most important feature is that read and write operations can be executed 

in parallel. In order to accomplish this feature the control logic 

provides two pointers named address-write and address-read. They run 

from 0 to 63 and then back to 0 in a circular way: obviously address-read 

has always to follow address-write, otherwise we would be extracting 

invalid data from the memory. Data is written in the fifo and the 

address-write pointer is incremented by one when the input push is set 

to 1: the input push of the fifo isthesamesignalastheoutput-push 

one from the barrel. In this way when the barrel shifter has an output 

valid, it is written in a free location of the fifo at the next clock cycle. 

The RAM read phase is activated by the pop input signal: for every 

clock cycle in which pop is 1, the data value corresponding to addressread 

is taken in output dataout and then the pointer address-read is 

incremented by 1. When both push and pop are set to 1 the fifo is 

read and written at the same time and the distance between the two 

pointers remains constant. Three important signals are: 

– query signal: the query signal is set to 1 when the memory contains 

at least 16 valid data, that is when the distance among the two 

pointers is greater or equal to 16. The query signal is used at 

the outmux block where a priority encoding based arbiter decides


which of the 8 queues has to be served in output. When a fifo 

blockisservedbytheoutmux, the number of total valid words 

decreases and the signal query comesbackto0. Itcanhappen 

that the signal query remains to 1 if more than 32 valid words were 

stored in the fifo. In this case it is possible that the fifo might be 

read again. All depends on how many queues are sending queries 

for being emptied to the scheduler. 

– empty signal: the empty signal is set to 1 when the fifo does not 

contain any valid data, that is when address-write and addressread 

have the same value and are pointing to the same memory 

location. This signal will be used by the feesiu block in order to 

decide when all the 8 queues have been completely emptied and a 

new data set can enter CARLOS. 

– full signal: the full signal is very important since it is backpropagated 

to the compressor block in order to assert the fact that 

the FIFO is getting full and the input stream has to be stopped. 

The compressor block will back-propagate this full signal to the 

AMBRA chip which will stop sending data to CARLOS. Obviously 

the full signal has to be asserted before the FIFO is really 

full, otherwise some input data would be lost. For this reason the 

fifo full signal works between 2 thresholds: 32 and 48: the full 

signal goes high when the fifo contains more than 48 valid words, 

then it comes back to 0 only when the fifo has been served by the 

outmux block, that is when the fifo contains less than 32 valid 

words. With this trick the risk for the fifo to get completely full 

is reduced, at least if the queues arbiter is fair enough with every 

input stream. 

3.5.4 The event-counter block 



75

76 




– event-id: output 3-bit bus. 

The event-counter block is a very simple 3-bit binary counter used 

to assign a number to every physical event, at least for being able to 

easily discriminate consecutive events. When the reset is active internal 

registers and outputs are put to 0, then, when the reset is inactive, 

the event-counter block increments by one its output signal event-id 

every time it samples the end-trace signal at logic level 1. The endtrace 

feeding the event-counter block is a signal coming from the feesiu 

block called all-fifos-empty. This signal is asserted for two clock periods 

when all the 8 end-trace signals have been set to 1 and when all the 

8 queues have been completely emptied. For this purpose CARLOS 

contains a global end-trace signal which is activated when all the 8 

local end-traces have been high for at least one clock period; it is not 

strictly necessary that a temporal overlap exists between the 8 signals. 

Nevertheless, this means that the global end-trace will never be put to 

1 if some of the local end-traces are not used and remain stuck at 0. 

After an end-trace global is activated, the feesiu block begins waiting 

for the 8 FIFOs being emptied: as soon as this happens the all-fifosempty 

signal is activated and the event-id signal is incremented by one. 

The signal all-fifos-empty stays at logical level 1 for two consecutive 

clock periods: nevertheless the event-id counter is incremented only by 

1. The value of event-id is used in the outmux block and it is sent to 

the SIU as a part of the header word. We thought that 3 bits could be 

sufficient to discriminate the events and for putting them in the right 

order during data decompression and reconstruction stages. 

3.5.5 The outmux block 


– indat7 : input 32-bit bus;

– indat6 : input 32-bit bus; 









– query: input 8-bit bus; 

– event-id: input 3-bit bus; 

– enable-read: input signal; 


– half-ladder-id: input 7-bit bus; 

– good-data: output signal; 

– read : output 8-bit bus; 

– output: out 32-bit bus. 

The outmux block has two distinct functions in the overall logic: 

– multiplexing the 8 compressed and packed streams onto a single 

32-bit output (femux sub-block); 

– deciding which queue has to be served using a priority encoding 

based arbiter (ppe sub-block). 

The femux and ppe blocks implement the following 17-word data packet 

transmission protocol (see Fig. 3.10): 

– a 32-bit header; 

– 16 32-bit data words, all coming from one macrochannel and from 

one event. 

77

78 


Figure 3.10: 17-bit word data transmission protocol 

The header contains the following information from MSB to LSB: 

– half ladder id (7 bits): this number is hardwired externally to each 

CARLOS chip, depending on the ladder it will be connected to; 

– packet sequence number (10 bits): this is a 10-bit wide counter 

incremented once a packet is transmitted, i.e. every 17 data words; 

– cyclic event number (3 bits): this is the event number coming from 

the event-counter block; 

– available bits (9 bits): these will be used in a future expansion of 

CARLOS; 

– half detector id (3 bits): every half ladder contains 8 half detectors. 

They are numbered from 0 to 7 and this number is provided by 

the macro-channel being served. 

Let’s take a look at the 2 sub-blocks of the outmux :


– femux is a multiplexer with nine 32-bit inputs and a 9-bit selection 

bus. The 9 data inputs are the header and the 8 input channels 

coming from the FIFOs. The selection bus value is given by the 

queues scheduler: this bus contains all zeros but one. 

– ppe stands for programmable priority encoder. It is a completely 

combinatorial block with two inputs and one output: request (8 

bits) contains the query signals coming from the 8 macro-channels; 

priority (8 bits) is a bus containing only one 1 and all the other 

bits at 0; served (8 bits), like priority, contains only one bit at 

logic level 1 and this bit indicates which of the 8 macro-channels 

has to be served from the femux. 

The programmable priority encoder works in a very simple way: 

it scans the request bus starting from the bit stuck at 1 in the 

priority bus until it finds a 1. Its bit position from 0 to 7 corresponds 

to the channel chosen by the arbiter. At the next choice 

that the arbiter has to take, the priority bus value is updated in 

the following way: the served bus value is shifted on the right as if 

it were a circular register and its value is assigned to the priority 

bus. In this way we avoid the risk of a queue being served many 

times consecutively in spite of other queues making requests. An 

example will easily clarify this situation: request = 10100010, priority 

= 00010000, served = 00000010. At the next clock cycle, 

the value ”00000001” will be assigned to the priority bus. There 

are several possible implementations for a scheduling algorithm 

based on a programmable priority encoder: they differ in area 

and timing requirements. We chose the implementation used in 

the Stanford University’s Tiny Tera prototype as described in [18]. 

I’ll try now to explain how the outmux block works: the outmux block 

is stopped and it is initialized when the reset signal is active. When 

the reset is inactive, the outmux block begins waiting for the enableread 

signal to get active. This is a signal coming from the feesiu block: 

when low it states that the link between the SIU and CARLOS has 

79

80 


not been initialized yet or it means that temporarily the SIU cannot 

accept data. When the enable-read is high, the SIU is able to receive 

data from CARLOS, so the outmux block begins evaluating the value 

of the query bus. When its value is low it means that no macro-channel 

has still required to be served, otherwise the ppe block decides which 

queue to send in output. The first word served as output is the header 

word containing the information on the macro-channel being served 

and other information as stated above in the paragraph. In order to 

get the 16 data words to send as output, the outmux block has to 

provide the right pop signal to send to one of the 8 FIFOs. The 8 

pop signals to the FIFOs are grouped in the 8-bit read bus; of course 

only one bit at a time will be asserted. Signal read(7) will be sent to 

fifonew7, read(6) to fifonew6 and so on, as to extract 16 valid data 

from the FIFO. Since we want to send data to the SIU at a 20 MHz 

clock (half the system clock frequency) the pop signal cannot be stuck 

at 1 for 16 clock periods but it is alternatively 0 and 1 in order to get 

a data word out from the FIFO one clock period every two. When 

the outmux block is putting in output the 17 words of the packet, the 

output signal good-data is set to 1 in order to grant the feesiu block 

that it is receiving significant data. While sending the last data word 

of a packet, the outmux block updates the priority bus value as stated 

above and examines the query bus value, then it computes the right 

served value. If served is not 0, that is if any request has occurred, the 

outmux block begins sending in output an other packet, without any 

interruptions (there are not wasted clock periods), otherwise the block 

stops waiting for a new request to be asserted. If the enable-read turns 

from 1 to 0 when transmitting data, the outmux block sends only an 

other valid word in output, then stops and waits for the enable-read 

signal to be restored to its active value: then it continues sending data 

to the feesiu block as if no pause had really occurred. The outmux 

block itself provides to increment the 10-bit packet sequence number 

after every packet has been completely transmitted. 

The reason why a 20 MHz clock has been chosen is related to the


total optical fibre bandwidth to be used by CARLOS: 800 Mbits/s. If 

CARLOS puts in output 32-bit data at 40 MHz the total bandwidth 

required is 1.280 Gbits/s, while at 20 MHz only 640 Mbits/s. For this 

reason a half-frequency data rate has been chosen as the final one. 

3.5.6 The feesiu (toplevel) block 


– huffman7 : input 32-bit bus; 










– end-trace7 : input signal; 








– fidir: input signal; 

81

82 


– fiben-n: input signal; 

– filf-n: input signal; 


– wait-request7 : output signal; 








– foclk: output signal; 

– fbten-n: bidirectional signal; 

– fbctrl-n: bidirectional signal; 

– fobsy-n: output signal; 

– fbd: bidirectional 32-bit bus. 

The VHDL feesiu block contains all the other block instances (see 

Fig. 3.11) and the logic working as interface with the SIU board. So 

far the feesiu block contains 8 instances of firstcheck, 8 instances of 

barrel, 8 instances of fifo, 1 instance of event-counter and 1 instance of 

outmux. However we can imagine the feesiu block as the block taking 

data from the outmux block and directly interfacing the SIU board, as 

if it were at the same hierarchical level as the other blocks. In Fig. 3.9 

the feesiu block is represented exactly in this fashion. 

3.5.7 CARLOS-SIU interface 

Let’s now take a look the interface signals between CARLOS and the 

SIU and how the communication protocol has been implemented:


Figure 3.11: Design hierarchy of CARLOS v1 

– fidir: it’s an input to CARLOS. It asserts the direction of the 

data flow between CARLOS and the SIU: when low, data flow is 

directed from the SIU to CARLOS, otherwise data flow is directed 

from CARLOS to the SIU. 

– fiben-n: it’s an input to CARLOS, active low. It enables the communication 

on the bidirectional buses between CARLOS and the 

SIU. When low, communication is enabled, otherwise communication 

is disabled. 

– filf-n: it’s an input to CARLOS, active low, ”lf” stands for link 

full. When the SIU is no longer able to accept data coming from 

CARLOS, it puts this signal active. When this happens CARLOS 

sends an other valid data word, then stops transmitting waiting 

for the filf-n signal to be asserted again. This is the signal used by 

the SIU to implement the back-pressure on the data flow running 

from the front-end to the data acquisition system. 

– foclk: it is a free running clock generated on CARLOS and driving 

83

84 


the CARLOS-SIU interface. It is a 20 MHz clock generated by 

dividing the system clock frequency by 2. Interface signals coming 

from the SIU are triggered on the falling edge of foclk. 

– fbten-n: it is a bidirectional signal, active low, it can be driven by 

CARLOS or by the SIU, ”ten” stands for transfer enable. When 

CARLOS is assigned to drive the bidirectional buses (when fidir 

is high and fiben-n is 0) fbten-n value is asserted from CARLOS: it 

turns to its active state when CARLOS is transmitting valid data 

to the SIU, otherwise it is inactive. When the SIU is assigned 

to drive the bidirectional buses (when fidir is 0 and fiben-n is 

0) fbten-n value is asserted from the SIU: it turns to its active 

state when the SIU is transmitting valid commands to CARLOS, 

otherwise it is inactive. 

– fbctrl-n: it is a bidirectional signal, active low, it can be driven by 

CARLOS or by the SIU, ”ctrl” stands for control. When CARLOS 

is assigned to drive the bidirectional buses (when fidir is 1 and 

fiben-n is 0) fbctrl-n value is asserted from CARLOS: it turns 

to its active state when CARLOS is transmitting a Front End 

Status Word to the SIU, otherwise, when in the inactive state, 

CARLOS is sending normal data to the SIU. When the SIU is 

assigned to drive bidirectional buses (when fidir is 0 and fiben-n 

is 0) fbctrl-n value is asserted from the SIU: it turns to its active 

state when sending command words to CARLOS, to its inactive 

state when sending data words. The second option has not been 

implemented on CARLOS since we decided that CARLOS needs 

only commands and not data from the SIU. Other detectors use 

this option in order to download data to the detector itself: this 

is the case, for example, of the Silicon Pixel Detector. 

– fobsy-n: it is an input signal to the SIU, active low, ”bsy” stands 

for busy. CARLOS should put this signal active when not able 

to accept data coming from the SIU. Since CARLOS has not to 

receive data from the SIU, this signal has been stuck at 1, meaning


that CARLOS will never be in a busy state. In fact it always has 

to accept command words coming from the SIU. 

– fbd: it is a bidirectional 32-bit bus on which data or command 

words are exchanged between CARLOS and the SIU. 

This is the way the communication protocol works: the SIU acts as the 

master and CARLOS acts as the slave, i.e. the SIU sends commands to 

CARLOS and CARLOS sends data and front end status words to the 

SIU. At first the link CARLOS - SIU has to be initialized and the SIU 

acts as the master of the bidirectional buses. So CARLOS waits for the 

bidirectional buses to be driven from the SIU (fidir is 0 and fiben-n is 

0) and waits for a valid (fbten-n = 0) command (fbctrl-n =0)named: 

Ready to Receive (RDYRX). This command is always used in order 

for a new event transaction to begin. The RDYRX command contains 

a transaction identifier (bits 11 to 8) and the string ”00010100” as the 

less significant bits. 

As the command is accepted and recognized, CARLOS waits for the 

fidir signal to change value in order to take possession of the bidirectional 

buses, then, if the filf-n is not active, it is able to send valid 

data on the fbd bus if the good-data signal is active. In this state, 

CARLOS sends valid data of an event to the SIU only when some 

queues are making requests of being served in output, otherwise the 

feesiu stops sending data by putting the fbten-n signal to 1. When 

an end-trace signal has arrived on each macrochannel and every queue 

has been completely emptied (no more data of a particular event are 

stored in CARLOS yet), CARLOS puts in output the Front End Status 

Word (FESTW), a word that confirms that no errors occurred and 

that the whole event has been successfully transferred to the SIU. The 

FESTW contains the Transaction Id code received upon the opening of 

the transaction (bits 11 to 8) and the 8-bit FESTW code ”01100100”. 

After this happens CARLOS begins to wait for some action of the SIU 

to be taken: it means that the SIU can decide to take back its control 

on the bidirectional buses and close the data link towards the data ac- 

85

86 


quisition system, or the SIU can leave the bidirectional buses control to 

CARLOS for an other data event to be sent. So far, CARLOS begins 

waiting 16 foclk periods: if nothing happens, CARLOS is able to begin 

sending data again without the need to receive some other commands 

from the SIU; if the SIU takes back the possession of the bidirectional 

buses, CARLOS closes the link towards the SIU and keeps waiting for 

an other RDYRX command raised from the SIU itself. 

The feesiu block implements this communication protocol with the SIU 

using a simple state-machine: for example state 0 is the state in which 

CARLOS is waiting for a command of initialization from the SIU, state 

1 is the state in which CARLOS sends data from the SIU, state 2 in 

which CARLOS sends the front end status word to the SIU, state 3 

in which CARLOS waits 16 foclk periods waiting for some action from 

the SIU to happen. 

An important feature of CARLOS realized in the feesiu blockisthe 

following one: CARLOS cannot accept a new event before the previous 

one has been completely sent in output, otherwise we run into the 

risk of mixing data belonging to different events. The only way CAR- 

LOS has to implement back-pressure on the AMBRA chips is using the 

wait-request signals. So far the wait-request signal has to avoid that 

CARLOS fetches new input data values while emptying the FIFOs. 

For this reason a new signal, dont-send-data, has been introduced for 

every macro-channel which turns to 1 when the end-trace is activated 

and turns back to 0 when all the FIFOs are completely empty. So 

the wait-request of every macro-channel is obtained by putting in OR 

the full and dont-send-data signals. The feesiu acknowledges that all 

the FIFOs have been emptied using the empty signal of every FIFO 

block. When all the 8 signals turn to 1 the feesiu block raises the allfifos-empty 

signal which stands at logical level 1 for at least two clock 

periods in order to be sensed by the foclk clock. The all-fifos-empty signal 

is also used to trigger the event-counter block: in fact the number 

of total events is exactly the same as the total number of occurrences 

of the all-fifos-empty signal. An other signal, end-trace-global is set to

3.6 — CARLOS v2 design flow 


1 only if all the local end-trace signals have been put to 1 for at least 

one clock period in the current event. From the moment in which the 

end-trace-global is asserted and when the all-fifos-empty is activated 

no new input data set can enter CARLOS. 

3.6 CARLOS v2 design flow 

Fig. 3.12 illustrates the digital design flow for CARLOS v2. The front 

end steps are exactly the same as the ones followed in the design of 

CARLOS v1. The only difference is the library used, being, in this 

case, the Alcatel Mietec 0.35 µm digital library provided via Europractice. 

This is a very rich library since it contains more than 200 

differents standard cells and RAM blocks with several dimensions. A 

87

88 


Figure 3.13: Layout of the ASIC CARLOS v2 

RAM generator software allows the designer to get a macrocell with 

the exact number of words and bits per word as requested: in our case 

a 64 32-bit macrocell instantiated 8 times, one for macrochannel. 

The back end steps were carried out at IMEC using the Avant! software 

Acquarius. We could not succeed to get a license of this software 

due to the high cost (more then 100k$ for a license), while no other 

available software, such as Cadence, was able to work with the design 

kit provided. The final physical layout is depicted in Fig. 3.13. The 

chip has a total area of 30 mm2 containing 300 k standard cells, 180

3.7 — Tests performed on CARLOS v2 

I/O pads and 24 RAM blocks. 

After the design of the layout, IMEC sent us the post-layout netlist 

and a SDF file (Standard Delay Format) containing the information 

on each net and cell delay for post-layout simulation with the same 

test-benches already used for pre-layout simulation. This is usually an 

iterative process since, if some simulation problems arise, the layout 

has to be re-designed. Luckily due to the relatively small working frequency 

(40 MHz) (the technology adopted can easily work up to 200 

MHz) the post-layout simulation gave no problems and the design was 

then sent to the foundry. 

3.7 Tests performed on CARLOS v2 

After receiving from the Alcatel Mietec foundry 20 samples of naked 

chips (without any package), they have been directly bonded on the 

test PCB at the INFN of Torino, one sample per PCB. The test PCB 

shown in Fig 3.14, especially designed for testing CARLOS v2 and for 

its use in test beam data taking, contains the following: 

– 5 2x10 pins DIL connectors pin compatible with the pattern generator 

and logic analyzer HP16600/16700A pods; 

– 2 Mictor 38 connectors; 

– a DIP switch providing a facility to setup the hardwired parameters, 

such as the half ladder ID; 

– filter capacitors for a total capacity greater than 100 nF; 

– buffers for preserving CARLOS input pads integrity. 

After testing the JTAG control unit on CARLOS, the connection towards 

the SIMU was successfully tested: after the SIMU opens a transaction, 

CARLOS takes possession of the bidirectional buses and starts 

sending data. After these tests, the SIMU has been replaced by the 

SIU board and all the data acquisition system, i.e. DIU (Destination 

89

90 


Figure 3.14: CARLOS v2 test board 

Interface Unit) and PCI RORC (Read Out Receiver Card) directly connected 

to a PC. So far testing CARLOS behavior with huge amounts 

of data becomes easier to simply use the Logic State Analyzer and the 

complete data acquisition system can be used to acquire data in test 

beams.

Chapter 4 

2D compression algorithm 

and implementation 

This chapter contains a brief description of the 2D algorithm [19] conceived 

at the INFN Section of Torino and a first implementation attempt 

in ASIC with the third prototype of CARLOS. 

4.1 2D compression algorithm 

The 2D algorithm operates a data reduction based on a two-threshold 

discrimination and a two-dimensional analysis along both the drift time 

axis and the SDD anode axis. The proposed scheme allows for a better 

understanding of the neighbourhoods of the SDD signal clusters, 

thus improving their reconstructability and also provides a statistical 

monitoring of the background features for each SDD anode. 

4.1.1 Introduction 

As shown in Chapter 3, due to the presence of noise a simple singlethreshold 

one-dimensional zero suppression does not allow a good clus- 

91

92 

2D compression algorithm and implementation 

ter reconstruction in all circumstances. Indeed in order to obtain a 

good compression factor using the 1D algorithm a threshold of about 

three times the RMS of the noise has to be used. Such threshold often 

determines a rather sharp cut of the tails of the anode signals containing 

high samples and, more important, it can completely suppress the 

anodic signals with small values which are on the sides of the cluster. 

Both these sharp cuts, particularly the latter, can significantly affect 

the spatial resolution. Though samples below a 3 RMS threshold have 

small information contents, it is conceivable that, in the more accurate 

off-line analysis, they can help to improve the pattern recognition and 

the fitting of the cluster features. In order to read out small-amplitude 

samples without increasing too much the collection of the noise, a twothreshold 

algorithm can be used, so that small samples that satisfy a 

low threshold are collected only when, along the drift direction, they 

are near to samples satisfying a high threshold. Since the charge cloud 

diffuses in two orthogonal directions for symmetry reasons and due the 

previous considerations, the two-threshold method should be applied 

along the anode axis too. We want that such a two-threshold twodimensional 

data compression and zero suppression algorithm satisfy 

the following criteria: 

– the values of the samples, in the neighbourhood of a cluster, be 

available both for an accurate measurement of the characteristics 

of the clusters and for a good monitoring and understanding of 

the characteristics of the background; 

– the statistical nature of the suppressed samples be available to 

monitor the noise level of the anodes and to obtain their baseline 

values, which have to be subtracted from the cluster samples in 

order to obtain a correct measurement of the related charge. 

Here follows a description of the studied algorithm: the data reduction 

algorithm is applied to the resulting matrix of 256 rows by 256 

columns like the one shown in the upper part of Fig. 4.1. Each matrix 

element expresses an 8-bit quantized amplitude. A row represents a


Figure 4.1: Example of the digitized data produced by a half SDD 

time sequence of the samples from a single SDD anode and a column 

represents a spatial snapshot of the simultaneous anode outputs for an 

instant of time. For each charge cloud we expect several high values in 

one or more columns and rows. This extension in both time and space 

thus requires that correlations in both dimensions be preserved for future 

analysis. We refer to correlations within a column as space-like 

and correlations within a row as time-like. Therefore, in the proposed 

two-threshold two-dimensional algorithm, the high threshold TH must 

be satisfied by a pixel value in order that it be part of a cluster, and the 

93

94 


W 

N 

C 

S 

Figure 4.2: Neighbourhood of the pixel C 

low threshold TL leads to the registering of a pixel whose value satisfies 

it, if adjacent to an other pixel satisfying TH. In this way the lower 

value pixels on the border of a cluster are encoded thus ensuring that 

the tails of the charge distribution are retrieved. 

Within this framework, a cluster is redefined operationally as a set of 

adjacent pixels whose values tend to stand out above the background. 

In the described algorithm there is a trade-off in the definition of such 

a cluster, which lies in the definition of adjacency. We have considered 

as adjacent (or neighbour) to the (i, j) element, the pixels for which 

only one of the two indexes change by 1: so far the neighbour pixels are 

(i − 1,j), (i +1,j), (i, j − 1) and (i, j + 1). Thus a correlation involves 

a quintuple composed of a central (C) pixel and its north (N), south 

(S), east (E) and west (W) neighbours only (see Fig. 4.2. In order to 

monitor the statistical nature of the suppressed samples, the number of 

zero quantized values (due either to negative analog values of the noise 

or to baseline equalization), and the numbers of samples satisfying TH 

and TL are recorded. The background average and standard devia- 

tion are obtained by applying a minimization procedure to the three 

counted data. An aspect of this reduction algorithm allows the conservation 

of information about the background both near and far from the 

clusters. When the thresholds are properly chosen, statistically, pairs 

and a few triplets of background pixels not associated with a particleproduced 

cluster will satisfy the described discrimination criteria and 

E


Figure 4.3: Cluster in two dimensions and its slices along the anode direction 

provide consistency information on the background statistics, assumed 

to be Gaussian white noise. At the same time single high background 

peaks are suppressed as zeros (if they do not have at least one neighbour 

that satisfies at least the low threshold) so as not to overload the 

data acquisition and to allow an efficient zero suppression. The only 

parameters needed as input to the 2D compression algorithm are the 

two thresholds, TH, TL and the baseline equalization values. 

4.1.2 How the 2D algorithm works 

The 2D algorithm makes use of two threshold values: 

– a high threshold TH for cluster selection; 

– a low threshold TL so to collect information around the selected 

cluster. 

The algorithm retains data belonging to a cluster and around a cluster 

in the following way (as graphically shown as an example in Fig. 4.3): 

– the pixel matrix is scanned searching for values higher than the 

TH value (70 in Fig. 4.3); 

– the pixels positioned around the previously selected ones are accepted 

if higher than the low threshold value TL (40 in Fig. 4.3), 

otherwise they are rejected; 

95

96 


– thus a cluster is defined and cluster values are saved exactly as 

they are: other pixels, not belonging to clusters, are discarded; 

– if a pixel value higher than the TH value is found but it has not 

pixel values higher than TL around its value is rejected. This is 

the case of the 78 value on the bottom-left corner in Fig. 4.3 which 

is discarded, even it its value is greater than the high threshold 

value. 

– pixel values belonging to a cluster are encoded using a simple lookup 

table method, assigning long codes to non-frequent values and 

short codes to frequent symbols. 

So far in Fig. 4.3, after applying the 2D compression algorithm, only 

the shadowed values are stored, while the other value ares erased. The 

2D algorithm is conceptually very simple to understand, but it is quite 

more complex than the 1D for what concerns hardware implementation. 

In fact having to perform a bi-dimensional analysis of the pixel array 

implies the need of storing all the information on a digital buffer on 

CARLOS, thus requiring a larger silicon surface and a higher cost. 

4.1.3 Compression coefficient 

Fig. 4.4 shows the 2D compression coefficient as a function of the high 

threshold value, calculated using data coming from the test beam of 

September 1998. The 2D compression algorithm reaches a compression 

ratioof22choosingTHvalue of 1.5 noise RMS and TL of 1.2 noise 

RMS. It is to be remembered that the 1D compression algorithm had 

to use a threshold level of 3 noise RMS in order to reach the target 

compression ratio. So far the 2D algorithm shows higher performances 

than the 1D since it reaches the target compression ratio, while losing 

a lower amount of physical information. This is the main reason why 

the 2D algorithm has been chosen as the one that will be implemented 

on the final version of CARLOS.


Figure 4.4: 2D compression coefficient ratio as a function of the high 

threshold 

4.1.4 Reconstruction error 

Even for what concerns the reconstruction error, the 2D algorithm 

proves to have better performances than 1D. In fact the difference values 

between cluster centroid position before and after compression are 

fitted by a Gaussian distribution centered around the 0 value with a 

σ value of 10 µm along the drift time direction and 10 µm alongthe 

anode direction, choosing 1.5 noise RMS for TH and 1.2 noise RMS for 

TL. So far the 2D algorithm manages to achieve a better cluster center 

resolution than 1D by keeping track of more pixel values around the 

cluster center. Moreover the 2D algorithm introduces a smaller bias on 

the reconstructed charge than 1D with a value of around 3 %, meaning 

that the reconstructed cluster charge is 3 % lower than before compression 

- decompression steps. 

Beside that the 2D algorithm is very useful for what concerns the study 

of the noise distribution: in fact monitoring the couples of noise samples 

passing the double threshold filter allows to recover information 

on the average and on the standard deviation of the Gaussian noise 

distribution. This is quite important for checking how the signal to 

background ratio changes in time. 

97

98 


If used in lossless mode, the 2D compression ratio is 1.3 versus the 

2.3 value obtained using the lossless version of the 1D algorithm: this 

requires a more complex second level compressor in counting room, in 

order to reach the target compression ratio of 22, in the case the 2D 

compression algorithm cannot be applied to data. In fact there are 

some cases in which it might prove no longer desirable the use of the 

2D compression algorithm: for example when the baseline value is not 

constant through the 256 samples of an anode row. This is the case of 

the present version of the PASCAL chip, which introduces a slope in 

each anode row baseline and, what is worst, the slope value varies from 

different rows. It is obvious that a fixed double-threshold compressor, 

as the one explained in this Chapter, cannot deal with this problem. So 

far the foreseen solution is to eliminate the baseline slope in the final 

version of PASCAL. If this proves to be not possible or if a baseline 

with slope behavior emerges after some working time, the use of the 

2D algorithm can no longer be accepted. In this case data compression 

on CARLOS has to be switched off and a second level compressor algorithm 

implemented directly in counting room will do the job. 

4.2 CARLOS v3 vs. the previous prototypes 

There are several differences between CARLOS v3 and the previous 

versions. This is a brief list containing the most important ones: 

1. CARLOS v1 and v2 were meant to work in a radiation free environment, 

since, when they were designed, the problem of radiation 

had not been faced yet. So far commercial technologies such as 

Xilinx FPGAs or Alcatel Mietec design kit have been chosen for 

prototype implementation. The necessity for CARLOS to work in 

a radiation environment emerged some times after sending CAR-

4.2 — CARLOS v3 vs. the previous prototypes 

LOS v2 to the foundry. The radiation level CARLOS has to withstand 

is in the range from 5 to 15 krads. This led us to the search 

of a radiation-safe technology. 

One of the possible solutions is given by SOI (Silicon On Insulator) 

technology which provide a complete radiation resistance. This is 

the case for instance of the 0.8 µm DMILL technology that is being 

widely used even in satellite applications at ESA (European 

Space Agency). The problem related to this technology in mainly 

one: the cost is too high for our budget. So far we decided to 

choose a commercial technology, IBM 0.25 µm, with a library of 

standard cells designed to be radiation tolerant up to some Mrads. 

The library has been designed by the EP-MIC group at CERN. 

2. Mechanical constraints emerged not allowing the use of the SIU in 

the end-ladder zone, since it is far too big for the space available. 

Another problem concerning the SIU is that this device cannot 

safely work in a radiation environment since it contains commercial 

devices, such as ALTERA PLDs. Finally the laser driver 

hosted on the SIU board has a mean life of a few years, while we 

are looking for something lasting until the end of the experiment 

data taking. 

These considerations led us to change all the readout architecture 

from CARLOS to the DAQ. Instead of directly interfacing 

the SIU, CARLOS v3 interfaces the radiation-tolerant serializer 

GOL chip (Gigabit Optical Link) [20]. Serial data is then sent 

to the counting room using a 200 m long optic fibre, deserialized 

using a commercial deserializer device and then sent to the SIU 

board using a FPGA device named CARLOS-rx that is still to 

be designed. This final readout architecture is shown in details in 

Fig. 4.5. 

3. CARLOS v3 contains only 2 data processing channels, versus the 

8 hosted in the two previous prototypes. This choice was due to 

99

100 


the need of reducing the ASIC complexity and to greatly reduce 

the possibility of losing data in case of chip failure. In fact if 

a CARLOS v2 chip breaks down for some reasons, data coming 

from a half-ladder, i.e. from 4 detectors, is completely lost until 

the chip is substituted with a working one. On the other side, if 

a CARLOS v3 chip breaks down, only data coming from an SDD 

detector are lost. So far a 2-channel version of CARLOS provides 

a greater failure resistance and is far less complex. 

4. CARLOS v3 contains a preliminary interface with the TTCrx chip 

that distributes trigger signals and the clock to the end-ladder 

board. 

5. CARLOS v3 also contains a BIST structure (Built In Self Test) 

for a quick test of the chip itself issued via the JTAG port. 

Figure 4.5: The final readout chain

4.3 — The final readout architecture 

4.3 The final readout architecture 

The chosen architecture for the final readout system introduces new 

items to carry on and new problems to solve. 

For instance splitting CARLOS in 4 chips makes every chip much simpler 

to design, test and control (CARLOS v2 is a very complex and 

difficult to debug chip), but moving the SIU board in counting room 

implies the design of the CARLOS-rx device taking data from 4 deserializer 

chips and feeding data to the SIU. 

Beside that, putting a 200 m distance between CARLOS and the SIU 

implies that no back-pressure can be used: in fact if the SIU asserts 

the filf − n signal, meaning that it cannot accept further data starting 

from the following foclk signal, CARLOS receives this information 

after 2 µs, i.e. after 40 foclk cycles. So far the CARLOS-rx chip has 

to contain a well-sized FIFO buffer chip to store data when the SIU is 

not able to accept them. 

The role of the JTAG link is shown in Fig. 4.6. In the new architecture 

a transaction can be opened and closed via the JTAG link, instead of 

using the 32-bit bus fbd. The JTAG link is obtained serializing the 

5-bit JTAG port coming from the SIU for transmission to the frontend 

zone through an optic fibre, then the HAL (Hardware Abstraction 

Layer) chip performs the serial to parallel conversion for distributing 

the JTAG signals to the PASCAL, AMBRA and CARLOS chips. A 

rad-hard version of the HAL chip has to be implemented yet. 

Currently we plan to use a commercial pair of chips for serializingdeserializing 

data from Agilent Technologies: in the final architecture 

the serializer chip will be substituted with the rad-hard Gigabit Optical 

Link (GOL) chip designed by the Marchioro group at CERN. This chip 

is a multi-protocol high-speed transmitter ASIC, wich is able to withstand 

high doses of radiation. The IC supports two standard protocols, 

the G-Link and GBit-Ethernet and sustains transmission data at both 

800 Mbits/s and 1.6 Gbits/s. The ASIC was implemented using CERN 

library 0.25 µm CMOS technology employing radiation tolerant layout 

101

102 


Figure 4.6: Final readout chain zoom 

techniques. 

A problem concerning the use of the GOL chip is to be solved yet: the 

TTCrx chip distributes to all front-end chips a clock with a maximum 

jitter of around 300 ps. This is not a problem for AMBRA and CAR- 

LOS ICs working at 40 MHz but it proves to be a big problem for the 

GOL chip, since it contains an internal PLL to multiply the incoming 

40 MHz clock by 20 or 40, so to get an internal 800 MHz or 1.6 GHz 

frequency. The PLL shows some synchronization problems with the 

incoming clock if the input jitter is greater than 100 ps. This problem 

has still to be faced and solved. 

4.4 CARLOS v3 

CARLOS v3 is our first prototype tailored to fit in the new readout 

architecture. The main new features of this chip are:

4.5 — CARLOS v3 building blocks 

– two processing channels; 

– the radiation tolerant technology chosen. 

Nevertheless CARLOS v3 does not contain the complete 2D compression 

algorithm as would be expected. We made this choice in order to 

acquire experience with a small chip with the new technology and with 

the new layout techniques since we had to carry out the layout design 

task. Taking into account that the CERN 0.25 µm library contains a 

small number of standard cells and they are not so well characterized as 

commercial ones, we decided to try the new design flow and new technology 

with a simple chip: the result is CARLOS v3, that has been 

sent to the foundry in November 2001 and will be tested starting from 

February 2002. 

As a compression block, CARLOS v3 only hosts the simple encoding 

scheme conceived as the final part of the 2D algorithm. Nevertheless if 

CARLOS v3 proves to be perfectly working, it will be used to acquire 

data in the test beams and will allow us to build and test the foreseen 

readout architecture. 

4.5 CARLOS v3 building blocks 

Fig. 4.7 shows the main building blocks of CARLOS v3. The complete 

design of CARLOS v3 has been carried out in Bologna: I have worked 

on the VHDL models, while other people worked on the C++ models 

of the same blocks. Each block has been designed both in VHDL and 

C++, so to allow an easy verification and debugging process. 

The main two processing channels are the ones with encoderbo, barrel15, 

fifonew32x15 and the outmux blocks: theseblockstakedata 

coming from the AMBRA chips, encode them using a lossless compression 

algorithm, pack them into 15-bit words and store them in a FIFO 

memory before sending them in output to the GOL chip one channel 

after the other. 

103

104 


Figure 4.7: CARLOS v3 building blocks


The channel containing the ttc-rx-interface and fifo-trigger15x12 receives 

trigger numbers (bunch counter and event counter) from the 

TTCrx chip and sends them in output at the beginning of each data 

packet. The event-counter block is a local event number generator providing 

a further information to be added to the event number coming 

from the TTCrx chip: this gives us a greater confidence of being able to 

reconstruct data and to find errors if present. Then a trigger-interface 

block handles the trigger signals L0, L1 and L2 coming from the Central 

Trigger Processor (CTP) through the TTCrx chip. A Command 

Mode Control Unit (CMCU ) receives commands issued through the 

JTAG port and puts CARLOS in one of some logic states: running, 

idle, bist and so on. Finally the BIST blocks on chip are based on a 

pseudo-random pattern generator and a signature maker circuit. Next 

paragraph contain a detailed description of these blocks. 

4.5.1 The channel block 

The channel block is the main processing unit contained in CARLOS 

for data encoding, packing and storing. It is composed by three blocks: 

encoderbo, barrel15 and fifonew32x15. Two identical channel blocks 

are hosted on CARLOS v3. 

4.5.2 The encoder block 


– value: input 8-bit bus; 

– value-strobe: input signal; 



– data: output 10-bit bus; 

– field: output 4-bit bus; 

105

106 


Input range Output code Total 

0-1 1 bit + 000 4bits 

2-3 1 LSB bit + 001 4bits 

4-7 2 LSB bits + 010 5bits 

8-15 3 LSB bits + 011 6bits 

16-31 4 LSB bits + 100 7bits 

32-63 5 LSB bits + 101 8bits 

64-127 6 LSB bits + 110 9bits 

128-255 7 LSB bits + 111 10 bits 

Table 4.1: Lossless compression algorithm encoding scheme 

– valid: output signal. 

The encoderbo block encodes 8-bit input data in variable length codes 

in the range from 4 to 10 bits long in a completely lossless way. Table 

4.1 contains a detailed description of the encoding mechanism. This 

encoding scheme provides a compression on input data based on the 

knowledge of the statistics of the stream: in fact small-value data are 

much more probable than high-value ones. So far most input data will 

be reduced from 8 to 4 or 5 bits, providing some degree of compression. 

Indeed it is possible that locally, in time, this compressor may provide 

an expansion of data: in fact if a long sequence of values greater than 

127 occur, the encoderbo block provides as output a stream of 10-bit 

data, that have to be temporarily stored in a FIFO buffer. Here is 

a description of how the block actually works: when the input signal 

value-strobe is high, the 8-bit input value is encoded in the 10-bit output 

data and the valid output signal is asserted. The field output signal 

is assigned the number of bits actually containing information in the 

10-bit data register. The block is synchronous with the rising edge of 

the clock, while the reset signal is active high and asynchronous.


Figure 4.8: Graphical description of how the barrel shifter works 

4.5.3 The barrel15 block 


– input: input 8-bit bus; 

– sel: input 4-bit bus; 

– load: input signal; 




– output-push: output signal; 


The barrel15 is the block packing the 4 to 10 bits variable length codes 

coming from the encoderbo block to a fixed length 15-bit word. Data 

are packed as shown in Figure 4.8. The barrel block makes use of two 

internal 15-bit registers, so to be able to break an input data in two 

pieces without losing any information: when the first word is put in 

output by putting the output signal output-push low, the second word 

is used to store the input data. The latency of the barrel block is of 

107

108 


2 clock periods: it means that it takes 2 clock periods before a word 

is packed by the barrel15 block. When the input signal end-trace is 

asserted, meaning that this is the last data belonging to the current 

event, the current value in the internal register is put in output even if 

it is not completely full: not defined bits are put to 0. 

Data coming from the barrel can be easily reconstructed by starting 

from the 3 LSBs of the first barrel word containing the information of 

how many bits have to be selected on the left side of the code. By 

going on in this way from the LSB to the MSB of every valid word, it 

is possible to retrieve all the encoded information. 

4.5.4 The fifonew32x15 block 


– push-req-n: input signal; 

– pop-req-n: input signal; 

– diag-n: input signal; 

– data-in: input 15-bit bus; 



– empty: output signal; 

– almost-empty: output signal; 

– half-full: output signal; 

– almost-full: output signal; 

– full: output signal; 

– error: output signal; 

– dataout: output 15-bit bus. 

The fifonew32x15 block has the purpose of storing information coming 

out from the barrel shifter. The multiplexing scheme that has been


chosen cannot avoid the use of buffers before the multiplexer: in fact 

since the output data is fairly allocated 50 % of the time to both channels 

(one clock period for channel 0, the next clock period for channel 

1 and so on) and since the encoding algorithm can locally, in time, behave 

as an expansor, data has to be locally stored before multiplexing. 

The only decision that has to be taken is about FIFO dimensions: we 

have chosen a FIFO containing 32 words coming from the barrel shifter 

(32x15 bits) in order to take into account the worst possible input data 

stream. The problem we have faced designing the FIFO block is the 

following one: a FIFO is usually composed of a dual port RAM block 

plus some logic for implementation of the First In First Out phylosophy. 

This is for example what has been done in CARLOS v2. Nevertheless 

the CERN library 0.25 µm only provides one size of RAM memories, 

that is 64x32 bits size. This block is at least 4 times bigger than the 

block dimensions we need (2048 bits versus 480). Beside that it is quite 

difficult, if not impossible, to share the same RAM block between two 

different FIFO designs: the idea to share the FIFOs of the two channels 

is quite difficult to implement since the number of read/write ports has 

to be doubled. So far we decided to design a flip-flop based RAM for 

the FIFO taken from the “Designer Foundation” library provided together 

with our design software Synopsys. This is a library containing 

IP (Intellectual Property) blocks ready to be inserted into a design such 

as logic and arithmetic blocks, RAMs and application-specific blocks, 

for instance for error checking and correction or for a JTAG controller. 

The idea is: it is completely useless that every ASIC designer loses 

time while designing a block that is necessary to hundreds of other designers 

in all over the world. With this idea in mind, many IP libraries 

have been collected such as the one provided by Synopsys we have been 

making use of. 

This is the behavior of the fifonew32x15 block: a push is executed when 

the push-req-n input is asserted (low) and either the full flag is inactive 

(low)orthefull flag is active and the pop-req-n input is asserted (low). 

So far a push can occur even if the FIFO is full, as long as a pop is 

109

110 


executed in the same cycle period. Asserting push-req-n in either of 

the above cases causes the data at the data-in port to be written to 

the next available location in the FIFO. A pop operation occurs when 

pop-req-n is asserted (LOW), as long as the FIFO is not empty. Asserting 

pop-req-n causes the internal read pointer to be incremented on 

the next rising edge of ck. Thus the RAM read data must be captured 

on the ck following the assertion of pop-req-n. Push and pop can occur 

at the same time if there is data in the FIFO, even when the FIFO is 

full. In this case first the pop data is captured by the next stage of 

logic after the FIFO and then the new data is pushed into the same 

location from which the data was popped. So far there is no conflict in 

a simultaneous push and pop when the FIFO is full. A simultaneous 

push and pop cannot occur when the FIFO is empty since there is no 

popdatatoprefetch. 

The FIFO block contains some important flags such as empty, almostfull, 

full. Theempty flag indicates that there are no words in the FIFO 

availabletobepopped. Thealmost-full flag is asserted when there 

are no more than 8 empty locations left in the FIFO. This number is 

used as a threshold and is very useful for preventing the FIFO from 

overflowing. When this flag is asserted the data-stop signal, output 

from CARLOS, is sent to the AMBRA chip asking to stop the data 

stream transmission. AMBRA requires 3 clock cycles before it actually 

stops sending data to CARLOS. So far the threshold level 8 chosen 

for the FIFO design has to take into account for these 3 clock periods 

delay due to AMBRA and for the latency due to the encoder and barrel 

blocks. So far this flag is very useful for managing data transmission 

between AMBRA and CARLOS without losing any data. The last flag 

full indicates that the FIFO is full and there is no space available for 

pushing data. If AMBRA - CARLOS communication works well this 

flag should never be asserted. Fig. 4.9 shows the FIFO timing waveforms 

during the push phase, while Fig. 4.10 shows the FIFO timing 

waveforms during the pop phase.


Figure 4.9: FIFO timing waveforms during the push phase 

Figure 4.10: FIFO timing waveforms during the pop phase 

4.5.5 The channel-trigger block 

The channel-trigger block has the purpose of getting trigger numbers 

from the TTCrx chip and store them before they are multiplexed and 

sent to the GOL chip. It is composed by two different blocks: the 

111

112 


ttc-rx-interface and the fifo-trigger block. 

4.5.6 The ttc-rx-interface block 


– TTCready: input signal; 

– BCnt: 12-bit input bus; 

– BCntLStr: input signal; 

– EvCntLStr: input signal; 

– EvCntHStr: input signal; 



– BCnt-reg: output 12-bit bus; 

– EvCntL-reg: output 12-bit bus; 

– EvCntH-reg: output 12-bit bus. 

The ttc-rx-interface block receives trigger information from the TTCrx 

chip when the input signal TTCready coming from the TTCrx chip 

is high, meaning that the TTCrx is ready. When BCntStr is high, 

the 12-bit input word is fetched in the register BCnt-reg, the same for 

EvCntLStr and EvCntHStr for the MSB and LSB of the 24-bit word 

event counter. Following a L2accept signal active the values of these 

three registers are written into 3 memory locations of the fifo-trigger 

block. Since the event can be discarded until the final confirmation 

arrives through signal L2accept it is necessary to wait for such a signal 

before storing them in the FIFO. 

4.5.7 The fifo-trigger block 

This block is logically equivalent to the FIFO block except for what 

concerns dimensions: its size is 15x12 words. During the transmission


of a complete event from AMBRA to CARLOS lasting for 1.6 ms, up 

to four events can be stored in the AMBRA chip, so far CARLOS has 

to process 4 triplets of incoming signals L0, L1accept and L2accept. 

Thus a 15 words deep FIFO is necessary for storing bunch counter and 

event counter information concerning 5 consecutive accepted events. 

When CARLOS is ready to send a data packet in output, the first 3 

trigger words are read and taken to the outmux block. So far a correct 

synchronization between data being sent and trigger information is preserved. 

Output flags from the fifo-trigger block empty, almost-full and 

full are not used by other blocks as a control since we do not expect to 

have a buffer overflow due to the structure of the AMBRA chip. 

4.5.8 The event-counter block 





– event-id: output 3-bit bus. 

A local event counting is performed on CARLOS thanks to the eventcounter 

block. It is a very simple 3-bit counter triggered by the eventident 

signal coming from the outmux block: this signals asserts that an 

event has been completely transmitted and a new one can be accepted. 

This number is used both in the header and in the footer words for a 

safer transmission protocol. 

4.5.9 The outmux block 



113

114 



– trigger-data: input 12-bit bus; 



– gol-ready: input signal; 

– fifo-empty: input 2-bit bus; 


– all-fifos-empty: input signal; 

– event-id: input 3-bit bus; 

– no-input-data: input signal; 

– event-identifier: output signal; 

– read-data: output 2-bit bus; 

– read-trigger: output signal; 

– output-strobe: output signal; 


The outmux block is a multiplexing unit for sending in output data 

coming from the two main processing channels in an interlaced way, 

meaning that during the even clock periods data coming from channel 

1 are put in output, while during the odd clock periods data coming 

from channel 0 are served. 

This is the way the outmux block behaves: as soon as data begin to fill 

the two FIFO blocks the outmux block begins to put in output a packet 

like the one shown in Fig. 4.11. The first 3 16-bit words contain trigger 

informations coming from the trigger channel, the first word contains 

the bunch counter, while second and third word contain event counter 

MSBs and LSBs respectively. Since trigger informations are 12-bit long 

they are added the bits 1011 as MSBs in order to be able to recognize 

them easily in a later phase of data reconstruction. 

Follow two header words containing the local event-id number and the


Figure 4.11: CARLOS v3 data transmission protocol 

externally hardwired information half-ladder-id. The MSBs from the 

header word are 110. 

Headers are followed by an even number of data words containing data 

from the two main channels: if a channel has not valid data to send, 

the MSB is put to 1 and all the other bits are set to 0, meaning that a 

dummy data is sent in output, otherwise the MSB is set to 0 meaning 

that the data word is valid. 

The data packet is then concluded with the transmission of two footer 

words containing the same information of the header regarding the 

event-id number and the number of words being sent in output. The 

MSBs are set to 1, so to uniquely identify the footer word type. 

The outmux block puts in output the 16-bit data words and the signal 

output-strobe. When this signal is high, CARLOS is transmitting data 

belonging to a packet, while when low CARLOS is not sending useful 

115

116 


information to the GOL chip. When the gol-ready signal coming from 

the GOL chip goes low, meaning that it has lost synchronization with 

the input clock, CARLOS stops sending data and begins transmission 

again only when gol-ready goes high. The outmux block also puts in 

output the 2-bit signal read-data that is sent in input to the 2 main 

FIFOsasapop signal and the signal read-trigger sent to the FIFOtrigger 

block. The block outmux also asserts the signal event-ident,that 

is used as a trigger for the event-counter block. The input signal allfifos-empty 

is a signal that puts an end to the data packet transmission 

since the end of an event has been reached: in fact after the occurrence 

of the input signals data-end1 and data-end0 high values, CARLOS 

waits until both FIFOs get empty in order to assert the all-fifo-empty 

signal. This triggers the end of an event transmission. 

4.5.10 The trigger-interface block 


– reference-count-trigger: input 8-bit bus; 

– L0 : input signal; 

– L1accept: input signal; 

– L2accept: input signal; 

– L2reject: input signal; 

– dis-trigger: input signal; 



– busy: output signal; 

– trigger: output signal; 

– abort: output signal. 

This block accepts as inputs the trigger signals L0, L1accept, L2accept 

and L2reject. Follows a brief description of how these signals can be


used for accepting or rejecting an event for storage: the L0 signal is 

asserted 1.2 µs after the interaction; L1accept signal is asserted 5.5 µs 

after the interaction, if it is not asserted in time the event is rejected; 

L2accept is asserted after 100 µs from the interaction if the event is 

accepted, otherwise a L2reject signal is asserted before 100 µs. It means 

that either a L2accept signal or a L2reject signal is asserted. 

The trigger-interface block receives these inputs, processes them to 

build 3 other signals: trigger, busy and abort. The trigger signal is 

L0 delayed of a quantity of clock cycles programmable via JTAG and 

is distributed to the PASCAL and AMBRA chips. This is the signal 

triggering an event data acquisition on the PASCAL chip. 

The busy signal is asserted just after L0, then waits in the active state 

until 5.5 µs after the interaction. If the signal L1accept is not asserted, 

then busy goes low again, otherwise it stays active until the signal 

dis-trigger coming from AMBRA is activated. The meaning is the 

following: until PASCAL is transferring data to AMBRA the readout 

system is not ready to accept any other trigger signals, that is to acquire 

any other data. The time necessary for the transmission of an event 

from PASCAL to AMBRA is about 360 µs. Finally the abort signal 

that CARLOS sends to AMBRA is asserted when the L1accept signal is 

not asserted at the prefixed time or when the L2reject signal is asserted. 

The abort signal causes data transmission from PASCAL to AMBRA 

to end and data already stored are discarded. 

4.5.11 The cmcu block 


– tdi: input signal; 

– tms: input signal; 

– trst: input signal; 

– tck: input signal; 

117

118 


Figure 4.12: CMCU logic state diagram 

– bist-ok-tcked: input signal; 

– bist-failure-tcked: input signal; 



– reference-count-trigger: output 8-bit bus; 

– tdo: output signal; 

– state-tcked: output signal; 

– reset-pipe: output signal. 

The Command Mode Control Unit (cmcu) is CARLOS internal control 

unit remotely controlled via the JTAG port. Serial data coming from 

the JTAG pin tdi are packed into 8-bit words and interpreted as a very 

simple program containing commands and operands. Fig. 4.12 shows 

CARLOS working states reachable using the JTAG port. 

At power-on CARLOS is put in an IDLE state in which no calculation 

is performed. Then it can be put is a RESET-PIPELINE state in which


an internal reset signal is asserted and all registers are initialized. The 

following state is the BIST (Built In Self Test) state in which CARLOS 

runs an internal test at working speed to check if everything is working 

fine or not, then depending on the test results CARLOS enters the 

BIST-FAILURE state or BIST-SUCCESS state. In case of success the 

8-bit word sent serially as output on tdo is A0, otherwise the word is 

55. In the state WRITE-REG CARLOS prepares to write an internal 

register with the value read via JTAG in the next state WRITE-REG- 

FETCH: this register contains the number of clock cycles of delay to be 

applied to the incoming L0 signal before passing it to the AMBRA chip. 

If needed, during the READ-REG stage the CARLOS user can read 

this value to check that no errors occurred during the writing phase 

by means of the tdo output JTAG pin. Then CARLOS can finally 

enter the RUNNING stage in which it is able to accept and process 

input data streams and to manage the interfaces towards the GOL and 

TTCrx chips. When CARLOS is not in RUNNING mode the busy 

signal is set high, meaning that no L0 trigger signal is accepted from 

the CTP and no data is transmitted to the GOL chip. 

4.5.12 The pattern-generator block 


– bist-start: input signal; 



– data: output 8-bit bus; 

– data-valid: output signal; 

– data-end: output signal. 

The pattern generator block is part of the BIST utility implemented 

on CARLOS v3. The BIST [21, 22] is an in-circuit testing scheme for 

digital circuits in which both test generation and test verification are 

119

120 


done by circuitry built into the chip itself. BIST schemes offer three 

attractive advantages: 

1. they offer a solution to the problem of testing large integrated 

circuits with limited number of I/O pins; 

2. they are useful for high speed testing since they can run at design 

speed; 

3. they do not require expensive external automatic test equipment 

(ATE). 

BIST schemes, in the most general sense, can have any of the following 

characteristics: 

– concurrent or non-concurrent operation: concurrent testing is designed 

to detect faults during normal circuit operation, while nonconcurrent 

testing requires that normal operation be suspended 

during testing. In CARLOS v3 non-concurrent operation has been 

chosen since we decided to use BIST only to check the correct behavior 

of the chip when off-line. 

– exhaustive or non-exhaustive test design: an exhaustive test of a 

circuit requires that every intended state of circuit be shown to 

exist and that all transitions be demonstrated. For large sequential 

circuits as CARLOS this is not practical, so we decided to 

implement a non-exhaustive testing design. 

– deterministic or pseudo-random generation of test vectors: deterministic 

testing occurs when specific produced vectors have to be 

applied, while pseudorandom testing occurs when random-like test 

vectors are produced. We chose the pseudo-random generation 

since its implementation requires much less area than the deterministic 

generation. Pseudo-random generation on CARLOS v3 

is performed by the pattern generator block. 

The pattern generator block provides a set of 200 pseudo-random test 

vectors for BIST. These vectors are provided at the same time to both


processing channels. The pseudo-random sequence is obtained using 

a linear feed-back shift register, that is a very simple structure and it 

requires a very small on-chip area. 

4.5.13 The signature-maker block 


– bist-vector: input 16-bit bus; 



– bist-strobe: output signal; 

– signature: output 16-bit bus. 

The signature maker block performs the signature analysis. In signature 

analysis, the test responses of a system are compacted into a 

signature using a linear feedback shift register (LFSR). Then the signature 

of the device under test is compared with the expected (reference) 

signature. If they both match, the device is declared fault free, otherwise 

it is declared faulty. Since several thousands of test responses are 

compacted into a few bits of signature by a LFSR, there is an information 

loss. As a result some faulty devices may have the same correct 

signature. The probability of a faulty device having the same signature 

of a working device is called the probability of aliasing. The probability 

of aliasing is shown to be approximately 2−m ,wheremdenotes the 

number of bits in the signature. 

The signature register implemented on CARLOS is 16 bits wide, so the 

probability of aliasing is 2−16 . The signature maker block takes the 

16-bit bist-vector word coming from the outmux block, performs the 

signature analysis, then, when the FIFO have been completely emptied, 

asserts the bist-strobe signal when the signature value is ready. 

121

122 



4.6 Digital design flow for CARLOS v3 

Fig. 4.13 shows in some details the digital design flow we have used for 

the design of CARLOS v3 with the CERN library 0.25 µm. Since it is 

quite a recent library, we had to face some problems: for instance the 

small number of standard cells, the lack of 3-state buffers, the lack of 

worst-case cell models, the fact that only Verilog models for cells and 

not VHDL models were provided and so on. 

The reason for these lacks has to be searched in the fact that up to now 

very few chips have been realized and tested using this library, so not 

so much characterization work could be done. 

So far we had to learn how to use the software Cadence Verilog XL for

4.7 — CARLOS layout features 

post-synthesis simulations, since Synopsys allows to simulate VHDL 

models only. Our main difficulty was due to the necessity of using 

VHDL-written testbenches for logic simulation and Verilog-written ones 

for netlist simulation: this can be very error-prone since it is quite difficult 

to exactly match the two models together. 

Beside that we had to learn how to use Cadence Silicon Ensemble for 

the place and route job. This is really a very difficult job when the 

standard cells are not completely characterized. We received a great 

help from Marchioro group especially for what concerns the back-end 

design flow. They suggested us to follow a completely flat approach to 

the problem since the chip is very small: the hierarchical approach, i.e. 

design the layout of each block and then route them together is only 

worthy when dealing with chip complexities one order of magnitude 

greater then ours. 

4.7 CARLOS layout features 

Fig. 4.14 shows a picture of the final layout of CARLOS v3, as it has 

been sent to the foundry. As one can easily observe it is pad-limited, 

i.e. the total silicon surface is due to the number of I/O pads (100) 

and not to the number of standard cells it contains. Adding some extra 

logic would not imply any additional cost if contained in the area that 

is now empty. So far we hope that adding the 2D compression logic will 

not substantially increase the chip area and, consequently, production 

cost. The total area is 16 mm2 corresponding to the minimal size the 

silicon wafer was divided into. 

CARLOS v3 is fairly a very simple chip if compared to CARLOS v2 

with its 300 kgates of logical complexity: in fact it contains only 10 

Kgates. Nevertheless it has been designed in order to test our approach 

to the new library and to verify that we were able to run through all 

the design flow steps. Our final check will be the test of the chip itself 

in order to verify that everything was correctly designed, so to have 

123

124 


Figure 4.14: CARLOS v3 layout picture 

very clear ideas for the design of the final version of CARLOS. 

A specific PCB is in the design phase right now: it will contain only 

the connectors for probing with the Tektronics pattern generator and 

logic analyzer pods and the chip itself. Differently from CARLOS v2, 

the chip will be bonded into a PGA package and inserted on the PCB 

using a ZIF socket. This will allow us to test the 100 samples of the 

chip by using only a few PCB samples.

Chapter 5 

Wavelet based compression 

algorithm 

As an alternative to the 1D and 2D compression algorithms conceived 

at the INFN Section of Torino, our group in Bologna decided to study 

other compression algorithms that may be used as a second level compressor 

on SDD data. After studying the main standard compression 

algorithms, we decided to focuse on a wavelet-based compression algorithm 

and its performances when used to compress SDd data. 

The wavelet based compression algorithm design can be divided in 4 

steps, requiring the use of different software tools: 

1. choice of the algorithm main features; 

2. optimization of the algorithm with respect to SDD data using the 

Matlab Wavelet Toolbox [23]; 

3. choice of the architecture for the implementation of the algorithm 

using Simulink [24]; 

4. comparison between the wavelet algorithm performances and the 

ones implemented on CARLOS prototypes, in terms of compression 

ratio and reconstruction error. 

125

126 

Wavelet based compression algorithm 

5.1 Wavelet based compression algorithm 

The idea of compressing SDD data using a multiresolution based compression 

algorithm comes from the growing success of this technique, 

both for uni-dimensional and bi-dimensional signal compression. 

Multiresolution analysis gives an equivalent representation of an input 

signal in terms of approximation and detail coefficients; these coefficients 

can then be encoded using standard techniques, such as run 

length encoding. 

An SDD event, i.e. data coming from a half-SDD, can be analyzed as 

a unidimensional data stream of 64k samples or as a bi-dimensional 

structure of 256 by 256 elements. So far the first choice we have to 

take is whether implementing a 1D or 2D multiresolution analysis. 

In 1D analysis the signal can be written as: 

S = 

⎛ 

⎝s1,s2,... ,s256 

 

1o ,s257,s258,... ,s512 

 

anode 

2o ,... ,s65281,s65282,... ,s65536 

 

anode 

256o anode 

In 2D analysis the signal can be written as: 

⎛ 

⎜ 

S = ⎜ 

⎝ 

s1,1 s1,2 ... s1,256 

s2,1 s2,2 ... s2,256 

. 

. 

. .. 

s256,1 s256,2 ... s256,256 

. 

⎞ 

⎟ 

⎠ 

1 o anode 

2 o anode 

. 

256 o anode 

⎞ 

⎠ 

(5.1) 

(5.2) 

In the case of 1D analysis, once chosen the two decomposition filters 

H and G, the multiresolution analysis can be applied with a number 

of levels, that is the number of cascadable filters, between 1 and 16. 

So far an orthogonal wavelet decomposition C with 64k coefficients is 

produced: the ratio of the approximation coefficients ai number to the 

detail coefficients di number depends on the number of decomposition

levels used: 

 

5.1 — Wavelet based compression algorithm 

S = s1,.... ............................ ,s65536 

⎛ 

⎞ 

 

0 decomposition levels 

⎝a1,......... ,a32768,d32769,......... 

,d65536⎠ 

1 decomposition level 

C = 

 

⎛ coeffs. app. 

coeffs. dett. ⎞ 

⎝a1,...... ,a16384,d16385,............ 

,d65536⎠ 


C = 

 



C = 

⎝a1,... ,a8192,d8193,................ 

,d65536⎠ 


⎛ 

 

coeffs. app. 

 

coeffs. dett. 

. 

,d5,................... ,d65536⎠ 


C = ⎝a1,a2,a3,a4 

 



C = ⎝ a1,a2 

 

⎛ 

C = 

coeffs. app. 

⎝ a1 

 

coeff. app. 

,d3,...................... ,d65536⎠ 


⎞ 

 

coeffs. dett. 

⎞ 

,d2,...................... ,d65536⎠ 


 

coeffs. dett. 

In the case of 2D analysis, once chosen the two decomposition filters 

H and G, the bi-dimensional decomposition scheme is applied with a 

number of levels to be chosen between 1 and 8. First, multiresolution 

analysis is applied to each row of the 2D signal, then each column resulting 

from the previous analysis is decomposed using the same number 

of levels. 

So far the 2D signal (5.2) is transformed into the 2D orthogonal wavelet 

decomposition, containing 64k coefficients; even in this case the ratio 

of the approximation coefficients number to detail coefficients number 

. 

127

128 


depends on the decomposition levels applied: 

⎛ 

S = 

⎜ 

⎝ 

s1,1 ......................... s1,256 

. 

s256,1 ......................... s256,256 

. 

⎞ 

⎟ 

⎠ 

⎛ 

⎞ 

a1,1 

⎜ . 

⎜ a128,1 

C = ⎜ d129,1 

⎜ 

⎝ . 

... 

... 

... 

a1,128 

. 

a128,128 

d129,128 

. 

d1,129 

. 

d128,129 

d129,129 

. 

... 

... 

... 

d1,256 

⎟ 

. 

⎟ 

d128,256 ⎟ 

d129,256 ⎟ 

. ⎟ 

⎠ 

d256,1 ... d256,128 d256,129 ... d256,256 

⎛ 

⎜ 

C = ⎜ 

⎝ 

⎛ 

⎜ 

C = ⎜ 

⎝ 

. 

a1,1 a1,2 d1,3 ....... .... d1,256 

a2,1 a2,2 d2,3 ....... .... d2,256 

d3,1 d3,2 d3,3 ....... .... d3,256 

. 

. 

. 

d256,1 d256,2 d256,3 ....... .... d256,256 

a1,1 d1,2 .................. d1,256 

d2,1 d2,2 .................. d2,256 

. 

. 

d256,1 d256,2 .................. d256,256 

. 

. 

⎞ 

⎟ 

⎠ 

⎞ 

⎟ 

⎠ 



. 



Applying multiresolution analysis to SDD data proves to be useful since 

approximation coefficients feature high values, since they represent the 

signal approximation, while detail coefficients feature values near to 0. 

So far, in order to get compression, detail coefficients can be eliminated 

without losing significant information on the input signal. 

An easy and effective technique for compressing data after multiresolution 

analysis is to put a threshold level over every coefficient ai and

5.2 — Multiresolution algorithm optimization 

di. What we expect is that approximation coefficients ai remain unchanged, 

while detail coefficients di are all put to 0. This is useful since 

the long zero sequences coming from the detail coefficients can be further 

compressed using the run length encoding technique. 

The multiresolution based compression algorithm described so far is a 

lossy technique but it can be used in a lossless way without putting the 

threshold on wavelet coefficients. 

5.1.1 Configuration parameters of the multiresolution 

algorithm 

Some algorithm parameters can be tuned in order to get the best performances 

in terms of compression ratio and reconstruction error. These 

parameters are: 

– the pair of decomposition filters H and G, used to implement the 

multiresolution analysis; 

– the number of dimensions used for the analysis: 1D or 2D; 

– the number of decomposition levels; 

– the threshold value applied to ai and di coefficients. 

5.2 Multiresolution algorithm optimization 

The multiresolution algorithm optimization has been carried out using 

the Wavelet Toolbox from Matlab. 

First, the pair of decomposition filters that, with a fixed value of the 

threshold, gives the higher number of null coefficients ai and di and the 

lower reconstruction error has been chosen; then the other 3 parameters 

have been evaluated one after the other for optimization. 

129

130 


5.2.1 The Wavelet Toolbox from Matlab 

The Wavelet Toolbox is a collection of functions from Matlab that, 

using Matlab line commands and a user-friendly graphical interface, 

allows to develop wavelet techniques to be applied to real problems. 

In particular the Wavelet Toolbox allowed us to: 

– perform the multiresolution analysis of a signal and the corresponding 

synthesis, using a wide variety of decomposition and 

reconstruction filters; 

– treat signals as uni-dimensional or bi-dimensional; 

– analyze signals on a variable number of levels; 

– apply different threshold levels to the coefficients obtained ai and 

di. 

The wide choice of filters corresponds to the wide number of wavelet 

families implemented by the Wavelet Toolbox, shown in Tab. 5.1 and 

in Fig. 2.10, Fig. 2.11 and Fig. 2.12. 

In particular the Haar family is composed by the wavelet function ψ(x) 

Family Name identifier 

Haar wavelet ’haar’ 

Daubechies wavelets ’db’ 

Symlets ’sym’ 

Coiflets ’coif’ 

Biorthogonal wavelets ’bior’ 

Reverse Biorthogonal wavelets ’rbio’ 

Table 5.1: Wavelet families used for multiresolution analysis 

and its corresponding scale function φ(x), already discussed in Chapter 

2. On the other side each Daubechies, Symlets e Coiflets family is 

composed by more than a pair of functions ψ(x) andφ(x): Daubechies 

family pairs are named db1, ... , db10, Symlets family pairs are named 

sym2, ... , sym8, while Coiflets family pairs are named coif1, ... , coif5.


Biorthogonal (bior1.1, ... , bior6.8) and Reverse Biorthogonal (rbio1.1, 

... , rbio6.8) are composed by quartets of functions ψ1(x), φ1(x), ψ2(x) 

and φ2(x), where, the first pair is used for decomposition and the second 

for reconstruction. Using a particular function of the Wavelet Toolbox 

which requires the name of the pair of functions ψ(x) andφ(x) chosen 

or the name of the quartet ψ1(x), φ1(x), ψ2(x) andφ2(x) when using 

Biorthogonal and Reverse Biorthogonal, it is possible to determine the 

impulse response representing, respectively, the low pass filter H and 

the high pass filter G used for decomposition and the low pass filter H 

and high pass filter G, used in the reconstruction stage. 

Multiresolution analysis and synthesis are computed as described in 

Chapter 3: in particular the analysis step is performed with a convolution 

operation between the input signal and the filters H and G, 

followed by decimation, while synthesis is performed with up-sampling, 

followed by a convolution operation between the signal and the filters 

H and G. 

5.2.2 Choice of the filters 

In order to choose the best filters H, G, H and G for SDD data compression, 

10 64-kbytes SDD events have ben analyzed using the Wavelet 

Toolbox using the wavelet families shown in Tab.5.1. 

Each signal S, interpreted both as unidimensional as in in Fig. 5.1 and 

bi-dimensional as in Fig. 5.2, has been processed in the following way: 

– after choosing a pair of functions ψ(x) andφ(x) or the quartet 

ψ1(x), φ1(x), ψ2(x), φ2(x), the corresponding filter coefficients H, 

G, H and G have been determined; 

– the signal S has been analyzed using the filters H and G obtaining 

the decomposition coefficients C; 

– a threshold th has been applied to the coefficients C, obtaining 

the modified coefficients Cth; 

131

132 


s 

a 5 

d 5 

d 4 

d 3 

−50 

40 

20 

d 0 

2 −20 

−40 

d 1 

150 

100 

50 

0 

30 

20 

10 

10 

0 

−10 

20 

0 

−20 

50 

0 

20 

0 

−20 

Decomposition at level 5 : s = a5 + d5 + d4 + d3 + d2 + d1 . 

1 2 3 4 5 6 

Figure 5.1: Uni-dimensional analysis on 5 levels of the signal S 

– the coefficients Cth have been synthesized into the signal R, using 

the filters H and G. 

Both in the uni-dimensional and in the bi-dimensional case, the performances 

related to compression have been quantified using the percentage 

P of the number of null coefficients in Cth, while the performances 

related to the reconstruction error have been quantified using the root 

mean square error E between the original signal S and the signal R, 

obtained after the analysis and synthesis of Cth. 

In particular, since the total number of elements in Cth is 65536, in 

the uni-dimensional case, the parameter P can be expressed in the 

following way: 

P = 

100 · (number of null coefficients in Cth) 

65536 

x 10 4 

(5.3) 

Even the total number of elements in S and in R is 65536, so, if si 

is the i-th element of the uni-dimensional signal S and ri is the i-th


50 

100 

150 

200 

250 

Original Image 

50 100 150 200 250 

Synthesized Image 

dwt 

idwt 

Approximation coef. at level 5 

Image Selection 

Decomposition at level 5 

Figure 5.2: Bi-dimensional analysis on 5 levels of the signal S 

element of R, the parameter E can be expressed in the following way: 

 

 

 

E = 1 

65536 

(si − ri) 

65536 

2 (5.4) 

In the bi-dimensional case P is calculated in the same way while, naming 

si,j as the (i, j)-th element of S and ri,j as the (i, j)-th element of 

R, the parameter E can be expressed in the following way: 

 

 

 

E = 1 256 

256 

(si,j − ri,j) 

65536 

2 (5.5) 

i=1 

Even if the parameters P and E cannot be directly comparable to 

the results obtained in the compression algorithms implemented on the 

CARLOS prototypes, they give an important indication about the performance 

of each filter set used during the analysis. 

In particular, P gives a rough estimation of how much the coefficients 

Cth can be compressed using the run length encoding, while E can 

i=1 

j=1 

133

134 


be interpreted as the error introduced in the value associated to each 

sample coming the SDD. The analysis results related to 10 SDD events 

are shown from Tab. 5.2 to Tab. 5.7. In particular, Tab. 5.2 shows 

the parameter P and E values related to a 5-level analysis using the 

Haar filter, both in 1D and 2D, with a threshold value th variable in 

the range 0-25. The other tables show the P and E values obtained 

with a 5-level analysis with a threshold th of 25 using filters belonging 

to Daubechies (Tab. 5.3), Symlets (Tab. 5.4), Coiflets (Tab. 5.5), 

Biorthogonal (Tab. 5.6) and Reverse Biorthogonal (Tab. 5.7) families, 

in the 1D and 2D cases. The uncertainties ∆P and ∆E have been 

reported in terms of the respective orders of magnitude only, since we 

are only looking for an estimation of these values. 

An intersting feature emerging from Tab. 5.2 is the progressive increase 

of the values P and E with the increase of the threshold values th applied 

to the coefficients C. 

The trend of P is easy to understand considering that, applying the 

threshold th to decomposition coefficients C means putting to 0 all coefficients 

less than th in absolute value: so far the greater the th value, 

the greater the parameter P value. 

For what concerns E, the greater the th value, the greater the differences 

between Cth and the original C and the distortion introduced. 

It is to be noticed that for a value of th equal to 0, the parameter 

P is 9.12, while the parameter E is 1.26 e-14, that is the percentage 

of null coefficients in Cth and the reconstruction error are very small. 

This is quite easy to understand for what concerns P since, without 

a threshold, the only null coefficients are a very small fraction of the 

total number. For what concerns E, avoiding to modify the coefficients 

C with the threshold assures a nearly perfect reconstruction of the signal. 

The value 1.26 e-14 comes from the finite precision of the machine 

performing the analysis and synthesis processes.


Haar 

1D 2D 

Threshold value th P E P E 

0 9.12 1.26 e-14 3.68 2.50 e-14 

1 24.68 0.27 22.21 0.28 

2 40.01 0.63 42.63 0.75 

3 58.60 1.64 56.34 1.19 

4 67.08 1.71 67.76 1.67 

5 75.56 2.09 75.50 2.09 

6 79.87 2.38 80.77 2.44 

7 83.56 2.68 84.96 2.77 

8 86.71 2.99 88.21 3.08 

9 88.82 3.23 90.75 3.36 

10 90.70 3.48 92.88 3.63 

11 92.21 3.72 94.49 3.87 

12 93.20 3.89 95.80 4.08 

13 94.16 4.07 96.78 4.26 

14 94.81 4.21 97.56 4.42 

15 95.33 4.34 98.20 4.57 

16 95.72 4.44 98.73 4.71 

17 96.03 4.54 99.05 4.80 

18 96.20 4.60 99.25 4.86 

19 96.41 4.67 99.44 4.93 

20 96.54 4.72 99.55 4.97 

21 96.62 4.76 99.64 5.01 

22 96.69 4.79 99.69 5.03 

23 96.73 4.81 99.74 5.05 

24 96.76 4.83 99.77 5.07 

25 96.79 4.85 99.80 5.09 

Table 5.2: Mean values of P and E on 10 SDD events (∆P ≈ ∆E ≈ 0.01): 

the analysis has been performed on a 5-level base, using the set of filters 

Haar derived from the Haar wavelet. 

135

136 


Daubechies 

1D 2D 

Filters P E P E 

db1 96.79 4.85 99.80 5.09 

db2 96.75 4.82 99.63 5.08 

db3 96.73 4.81 99.54 5.07 

db4 96.73 4.81 99.48 5.07 

db5 96.72 4.81 99.33 5.07 

db6 96.71 4.81 99.27 5.07 

db7 96.72 4.82 99.20 5.07 

db8 96.70 4.81 99.08 5.08 

db9 96.69 4.81 98.98 5.09 

db10 96.68 4.80 98.98 5.09 

Table 5.3: Mean values of P ed E on 10 SDD events (∆P ≈ ∆E ≈ 0.01): 


Daubechies and a threshold level th equal to 25; the values obtained with 

db1 are equivalent to the ones obtained with Haar, since the corresponding 

filters are equivalent. 

Symlets 

1D 2D 


sym2 96.75 4.82 99.63 5.08 

sym3 96.73 4.81 99.54 5.07 

sym4 96.74 4.82 99.43 5.07 

sym5 96.72 4.81 99.38 5.06 

sym6 96.73 4.81 99.33 5.07 

sym7 96.70 4.80 99.17 5.06 

sym8 96.71 4.80 99.11 5.08 



Symlets and a threshold value th equal to 25.


Coiflets 

1D 2D 


coif1 96.74 4.82 99.51 5.07 

coif2 96.72 4.80 98.32 4.75 

coif3 96.72 4.81 99.60 5.06 

coif4 96.69 4.80 98.62 5.06 

coif5 96.68 4.80 98.29 5.05 

Table 5.5: Mean values of P and E on 10 SDD events (∆P ≈ ∆E ≈ 0.01): 


Coiflets and a threshold value th equal to 25. 

Biorthogonal 

1D 2D 


bior1.1 96.79 4.85 99.80 5.09 

bior1.3 96.68 4.81 99.48 5.07 

bior1.5 96.64 4.82 99.25 5.05 

bior2.2 96.28 4.71 98.70 4.94 

bior2.4 96.28 4.65 98.56 4.92 

bior2.6 96.23 4.62 98.27 4.91 

bior2.8 96.21 4.63 97.81 4.91 

bior3.1 93.41 5.68 94.15 5.58 

bior3.3 94.37 4.84 95.43 5.01 

bior3.5 94.70 4.65 96.60 5.10 

bior3.7 94.81 4.59 95.13 4.85 

bior3.9 94.88 4.56 94.13 4.85 

bior4.4 96.75 4.82 99.39 5.07 

bior5.5 96.78 4.88 99.46 5.10 

bior6.8 96.68 4.79 98.95 5.04 

Table 5.6: Mean values of P and E using the Biorthogonal filters 

137

138 


Reverse Biorthogonal 

1D 2D 


rbio1.1 96.79 4.85 99.80 5.09 

rbio1.3 96.77 4.85 99.57 5.08 

rbio1.5 96.75 4.86 99.39 5.06 

rbio2.2 96.78 4.92 96.89 4.58 

rbio2.4 96.79 4.88 99.47 5.12 

rbio2.6 96.77 4.87 99.32 5.11 

rbio2.8 96.78 4.88 99.18 5.12 

rbio3.1 96.38 8.67 98.76 11.29 

rbio3.3 96.72 5.14 99.29 5.39 

rbio3.5 96.76 4.95 99.28 5.18 

rbio3.7 96.76 4.92 99.09 5.18 

rbio3.9 96.74 4.91 98.97 5.20 

rbio4.4 96.68 4.80 99.29 5.06 

rbio5.5 93.32 4.63 98.56 4.92 

rbio6.8 96.71 4.81 99.10 5.08 

Table 5.7: Mean values of P ed E on 10 SDD events (∆P ≈ ∆E ≈ 

0.01): the analysis has been performed on a 5-level base, using a set of 

filters Rev. Biorthogonal and a threshold value th equal to 25; the values 

obtained with bior1.1 are equivalent to the ones obtained with Haar, since 

the corresponding filters are equivalent. 

The common feature from Tab. 5.3, Tab. 5.4, Tab. 5.5, Tab. 5.6 and 

Tab. 5.7 is the increasing value of P and E with the increase of the th 

value. 

Nevertheless some wavelet families are better suited than others to the 

compression task; by comparing the values obtained for th = 25, it is 

evident that the Haar set of filters shows the best performances. In 

particular with P =96.79 and E =4.85 in the uni-dimensional case 

and P =99.80 and E =5.09 in the bi-dimensional case, the Haar set 

of filters gets the higher percentage of null coefficients with an accept-


Family Set of filters name Filter length 

Haar haar 2 

Daubechies dbN 2N 

Symlets symN 2N 

Coiflets coifN 6N 

Biorthogonal bior1.1 2 

biorN1.N2, N1=1,N2=1 max(2N1,2N2)+2 

Reverse Biorthogonal rbio1.1 2 

rbioN1.N2, N1=1,N2=1 max(2N1,2N2)+2 

Table 5.8: Length of filters belonging to different families 

able error. The choice of the Haar filters can be supported with other 

argomentations too, concerning Haar filter’s length H, G, H and G, 

i.e. the number of coefficients which characterize the impulse response. 

As shown in Tab. 5.8 filters belonging to the Haar family have the 

smallest number of coefficients among filters, obviously together with 

the set of filters db1, bior1.1 and rbio1.1. Since the analysis and synthesis 

processes consist of successive convolutions between the signal 

to analyze or synthesize and the respective filters, this small number 

of coefficients allows for a higher execution speed of the analysis and 

synthesis processes. 

5.2.3 Choice of the dimensionality, number of levels 

and threshold value 

Once chosen the Haar set of filters, we studied the effect on the P and E 

parameters of dimensionality (1D or 2D), the number of levels used for 

decomposition (1,2, ... ,16 in 1D and 1,2, ... ,8 in 2D) and the value 

of the threshold th. 

Tab. 5.9 and Tab. 5.10 show the analysis of the usual 10 SDD events in 

139

140 


1D and 2D; each table also contains the value of P and E for 1, 3 and 

5 levels of decomposition and for each level a threshold value between 

0 and 25 has been adopted. 

The first result is that bi-dimensional analysis produces a higher percentage 

P of null coefficients than the uni-dimensional case; nevertheless 

its E values are also higher. 

For instance comparing the P and E values for a threshold value th 

of 35 the 1D analysis on 1 level determines P =50.01 and E =1.85, 

while 2D analysis determines P =74.96 and E =3.96; the same 1D 

analysis on 3 levels determines P =87.45 and E =4.18, versus the 

values P =99.80 and E =5.09 in the 2D case. 

An other result we obtained from the tables is that, once decided 

whether to use 1D or 2D analysis, an increase in the number of decomposition 

levels determines an increase in the values of the parameters 

P and E. 

For instance, by comparing values in Tab. 5.9 obtained with th equal to 

25, it can be noticed that 1D analysis on 1 level determines P =50.01 

and E =1.85, on 2 levels P =87.45 and E =4.18, while on 3 levels 

P =96.79 and E =4.85. The same concept holds true for 2D analysis 

and synthesis. So far we found out that the optimized version of a 

multiresolution analysis based algorithm for SDD data is a 2D analysis 

on the maximum number of decomposition levels using the Haar set of 

filters. 

For what concerns the threshold th, the parameters P and E increase 

when th is increased. In order to decide the th valuewehavetoable 

to quantify the reconstruction error introduced after wavelet analysis 

and to compare it with the compression algorithms implemented on 

CARLOS.

5.3 — Choice of the architecture 

5.3 Choice of the architecture 

The precision related to the architecture chosen for the implementation 

of the multiresolution analysis can strongly affect the percentage P of 

null coefficients and the reconstruction error E. As an example it is 

sufficient to apply both the analysis and synthesis processes to an input 

signal without any threshold : the reconstruction error E, though very 

little, is different from 0, due to the finite precision that our Pentium 

II processor used to perform the calculations. 

In order to quantify the influence of the architecture on the algorithm 

performance we used Simulink, a software tool from Matlab for the 

design and simulation of complex systems, and Fixed-Point Blockset 

[25] that allows to simulate the performances of a given algorithm when 

implemented on different architectures, both in fixed and floating point. 

5.3.1 Simulink and the Fixed-Point Blockset 

The Fixed-Point Blockset tool [25] is one of the Simulink libraries which 

contains blocks performing operations between signals such as sum, 

multiplication, convolution and so on, simulating various types of architectures, 

both fixed and floating point. This tool is very useful since 

it allows the designer to study the performance of a given algorithm on 

different architectures before the actual implementation takes place. 

For instance, this tool can be successfully used in order to decide if 

a Fourier transform can be implemented with acceptable performance 

in a fixed-point DSP (Digital Signal Processor) or it has to be implemented 

in a floating-point DSP. The difference is relevant especially for 

cost reasons, since a floating-point DSP has a much higher cost than 

a fixed-point one. We used the Fixed-Point Blockset with the same 

purpose of finding the more suitable architecture before actual implementation. 

Among the various floating and fixed-point architectures handled by 

141

142 


the Fixed-Point Blockset, we studied the following ones: 

– double precision floating point IEEE 754 standard architecture; 

– single precision floating point IEEE 754 standard architecture; 

– fractional fixed point. 

IEEE 754 standard architecture is one of the most widespread architectures 

and it is used in most floating-point processors. 

When the double precision is used, the standard architecture requires 

a 64-bit word in which 1 bit stands for the sign s, 11 bits for the exponent 

e and the remaining 52 bits for the mantissa m. The relationship 

b b b b 

63 62 51 0 

s e m 

between binary and decimal representation is the following one: 

valore decimale = (−1) s · 2 e−1023 (1.m) , 0


on the right (b0 − bs−1) contain the fractionary part of the number, one 

bit on the left (bs) contains the sign of the number and the other guard 

bits (bs+1 − b31) on the left of the radix point contain the integer part 

of the number. 

It is to be noticed that double precision floating point IEEE 754 stan- 

b b b 

s+1 

31 30 s 

guard bits 

b b s−1 1 

radix point 

b 0 b 

dard architecture features a precision of 2−52 ≈ 10−16 , single precision 

IEEE 754 has a precision of 2−23 ≈ 10−7 , while fractional fixed point 

architecture has a precision of 2−s , i.e. the precision depends on the 

number of bits being used for the fractional part of the number. So 

far the study of the influence of the fixed fractional architecture on the 

multiresolution analysis has been carried on by varying the position of 

the radix point among the 32 bit word. 

5.3.2 Choice of the architecture 

Implementing bi-dimensional multiresolution analysis and synthesis using 

Simulink is quite a long job, both in terms of design and simulation 

time. So far we decided to implement a uni-dimensional algorithm on 

16 decomposition levels, since it is a much quicker and simpler job. 

Beside that it gives a rather good estimation on the performances of 

the 3 architectures on an algorithm very similar to the one we have 

chosen. 

The implementation with Simulink of the multiresolution analysis 

and synthesis processes is shown in the external blocks in Fig.5.3: the 

block on the left performs the 1D analysis of the signal S using the 

Haar set of filters, while the block on the right applies a threshold on 

the decomposition coefficients and performs the synthesis of the signal 

143

144 


D1 

D2 

D1 del 

D2 del 

D1 

D2 

D3 

D4 

D3 del 

D4 del 

D3 

D4 

D5 

D6 

D5 del 

D6 del 

D5 

D6 

D7 

D8 

D7 del 

D8 del 

D7 

D8 

R 

Segnale Ricostruito 

D9 

D10 

D9 del 

D10 del 

D9 

D10 

Segnale 

S 

Segnale 

Ricostruito 

D11 del 

D12 del 

D11 

D12 

Segnale 

D13 del 

D14 del 

D13 

D14 

D15 del 

D16 del 

D15 

D16 

D1 

D2 

D3 

D4 

D5 

D6 

D7 

D8 

D9 

D10 

D11 

D12 

D13 

D14 

D15 

D16 

A16 

Figure 5.3: Developed Simulink blocks: from left to right the analysis 

block, the delay block and the threshold and synthesis block 

D11 

D12 

D13 

D14 

D15 

D16 

A16 

A16 del 

A16 

Applicazione soglia e 16 livelli di sintesi 

Delay 

16 livelli di analisi

Dettaglio 1 

1 

D1 

2 

Downsample 

Hi_Dec Filter 

1 

Segnale 

2 

Downsample1 

Low_Dec Filter 

Dettaglio 2 

2 

D2 

2 

Downsample2 

Hi_Dec Filter1 

2 

Downsample3 

Low_Dec Filter1 

Dettaglio 3 

3 

D3 

2 

Downsample4 


Dettaglio 6 

6 

D6 

2 

2 

Downsample5 


Downsample10 


Dettaglio 4 

4 

D4 

2 

Downsample6 



2 

Downsample7 


2 

Dettaglio 5 

5 

D5 

2 

Downsample8 


Downsample11 


2 

Downsample9 


Dettaglio 6 

6 

D6 

2 

Downsample10 


2 

Downsample11 


Dettaglio 7 

7 

D7 

2 

Downsample12 


2 

Downsample13 


Dettaglio 8 

8 

D8 

2 

Downsample14 


Figure 5.4: Zoom on the developed analysis block 

2 

Downsample15 


Dettaglio 9 

9 

D9 

2 

Downsample16 


2 

Downsample17 


Dettaglio 10 

10 

D10 

2 

Downsample18 


2 

Downsample19 


Dettaglio 11 

11 

D11 

2 

Downsample20 


2 

Downsample21 


Dettaglio 12 

12 

D12 

2 

Downsample22 


2 

Downsample23 


Dettaglio 13 

13 

D13 

2 

Downsample24 


2 

Downsample25 


Dettaglio 14 

14 

D14 

2 

Downsample26 


2 

Downsample27 


Dettaglio 15 

15 

D15 

2 

Downsample28 


2 

Downsample29 


Dettaglio 16 

16 

D16 

2 

Downsample30 


Approssimazione 16 

17 

A16 

2 

Downsample31 


145

146 


1 


D1 

D2 

D3 

D4 

D5 

D6 

D7 

D8 

D9 

D10 

D11 

D12 

D13 

D14 

D15 

D16 

A16 

In1 

Out1 

In2 

Out2 

In3 

Out3 

In4 

Out4 

In5 

Out5 

In6 

Out6 

In7 

Out7 

In8 

Out8 

In9 

Out9 

In10 

Out10 

In11 

Out11 

In12 

Out12 

In13 

Out13 

In14 

Out14 

In15 

Out15 

In16 

Out16 

In17 

Out17 

To Workspace 

D1 th 

D2 th 

D3 th 

D4 th 

D5 th 

D6 th 

D7 th 

D8 th 

D9 th 

D10 th 

D11 th 

D12 th 

D13 th 

D14 th 

D15 th 

D16 th 

A16 th 

D1 

D2 

D3 

D4 

D5 

D6 

D7 

D8 

D9 

D10 

D11 

D12 

D13 

D14 

D15 

D16 

A16 

1 

D1 

2 

D2 

3 

D3 

4 

D4 

5 

D5 

6 

D6 

7 

D7 

8 

D8 

Figure 5.5: Zoom on the developed threshold and synthesis block 


9 

D9 

10 

D10 

11 

D11 

Dettagli 

12 

D12 

13 

D13 

14 

D14 

15 

D15 

16 

D16 

16 livelli di ricostruzione 

Applicazione soglia 

17 

A16 

Approssimazione

2 

Upsample 

Dettaglio 1 1 

D1 

Hi_Rec Filter10 

1 


FixPt 

Sum13 

2 

Upsample1 

2 

D2 

Dettaglio 2 



2 

Upsample3 

FixPt 

Sum12 

Low_Rec Filter10 

2 

Upsample4 

3 

D3 

Dettaglio 3 


2 

Upsample10 

6 

D6 

2 

Upsample2 

FixPt 

Sum11 

2 

Upsample6 

Dettaglio 4 4 

D4 




2 

Upsample5 

FixPt 

Sum10 


2 

Upsample9 

2 

Upsample8 

5 

D5 

Dettaglio 5 

FixPt 

Sum8 


2 

FixPt 

Upsample7 

Sum9 


2 

Upsample10 

Dettaglio 6 6 

D6 


2 

Upsample9 

FixPt 

Sum8 


Dettaglio 7 7 

2 

D7 

Upsample12 

2 

Upsample11 



FixPt 

Sum7 

2 

Upsample11 

FixPt 

Sum7 


Dettaglio 8 8 

2 

D8 

Upsample14 


2 

Upsample13 

FixPt 

Sum6 


Dettaglio 9 9 

2 

D9 

Upsample16 


2 

Upsample15 

FixPt 

Sum5 


2 

Upsample18 

Dettaglio 10 10 

D10 

Figure 5.6: Zoom on the developed synthesis block 


2 

FixPt 

Upsample17 

Sum4 


2 

Upsample20 


D11 


2 

Upsample19 

FixPt 

Sum1 


2 

Upsample22 


D12 


2 

Upsample21 

FixPt 

Sum2 


2 

Upsample24 


D13 


2 

Upsample23 

FixPt 

Sum3 


2 

Upsample26 


D14 


2 

Upsample25 

FixPt 

Sum14 


2 

Upsample28 


D15 


2 

Upsample27 

FixPt 

Sum15 


2 

Upsample30 


D16 


2 

Upsample29 

FixPt 

Sum16 


2 

Upsample31 

17 

A16 

Approssimazione 16 


147

148 


R. 

The analysis block has been implemented as a 16-level cascade, see 

Fig. 5.4, containing high-pass filter operators (Hi Dec Filter), low pass 

filter operators (Low Dec Filter) and Downsample operators. Hi Dec 

Filter operators perform convolution between the incoming signal and 

the Haar high pass decomposition filter, Low Dec Filter operators perform 

convolution between the incoming signal and the Haar low pass 

decomposition filter, while the Downsample operators perform the decimation 

of the incoming signal. 

Fig. 5.5 shows the threshold and synthesis block which is subdivided 

into 3 major sub-blocks: the sub-block on the left applies a threshold 

on the input stream, the sub-block on the right performs the synthesis 

of the signal, while the central block, called To Workspace, stores the 

decomposition coefficients after the application of the threshold, so that 

this value is used for calculating the percentage P of null coefficients. 

The synthesis block has been implemented, in analogy to the analysis 

block, as a 16-level cascade, see Fig. 5.6, containing Hi Rec Filter operators 

performing the convolution between the incoming signal and the 

Haar high-pass reconstruction filter, Low Rec Filter operators performing 

the convolution between the incoming signal and the Haar low-pass 

reconstruction filter, FixPt Sum operators performing the sum between 

filtered signals and Upsample operators performing the upsampling on 

the incoming signals. 

Finally the Delay block shown in Fig. 5.3 is the block with the task of 

starting the synthesis process only when the analysis job has already 

been completed. It is to be noticed that the analysis, delay and synthesis 

blocks have been developed starting from simple blocks belonging 

to the Fixed Point Blockset, such as filtering, downsampling and upsampling 

blocks, and so on. 

After performing the analysis and synthesis of the 10 SDD events with 

a value of the threshold equal to 25 for the 3 architectures described 

above, we have obtained the values shown in Tab. 5.11; as a notation 

the floating point double precision standard architecture IEEE 754 is

5.4 — Multiresolution algorithm performances 

indicated as ieee754doub, the single precision floating point standard 

architecture IEEE 754 as ieee754sing and the fractional fixed point architecture 

as fixed(s), wheresis the number of bits representing the 

fractional part of the number. 

Simulink simulations show how the values P and E depend on the 

precision of the selected architecture: in particular taking as a reference 

the values P and E less influenced from the finite precision of the 

calculations, i.e. the values related to the architecture ieee754doub, it 

can be noticed in the cases ieee754sing, fixed(18), fixed(15), fixed(12) 

and fixed(9), a slight increase in the error E while P remains constant, 

while in cases fixed(7), fixed(5) and fixed(3) the discrepancy with the 

values obtained in the case ieee754doub increases strongly. 

So far the results we have obtained pointed us towards the choice of 

one of the following architectures: ieee754doub, ieee754sing, fixed(18), 

fixed(15), fixed(12) and fixed(9). Our choice fell on the ieee754sing as 

explained in Par. 5.5. 

5.4 Multiresolution algorithm performances 

For a direct comparison of the performances obtained by the compression 

algorithms implemented on the CARLOS prototypes and by the 

multiresolution based algorithm, we developed a FORTRAN subroutine 

running analysis and synthesis on a floating-point single precision 

SPARC5 processor. The FORTRAN subroutine can be logically divided 

in two parts: the first with the aim of giving an estimation of the 

algorithm in terms of compression, the second with the aim of giving 

an estimation of the reconstruction error on the cluster charge. 

The first part of the subroutine performs analysis, threshold th application 

and synthesis on SDD events containing several charge clusters. 

After applying analysis and threshold, for each SDD event the reciprocal 

of the compression ratio is calculated c−1 = no output bits 

no input bits ,with 

the assumption that each non-null decomposition coefficient is encoded 

149

150 


using two 32-bit words, one representing the value of the coefficient 

itself, the other representing the number of null coefficients between 

the current and the previous non-null coefficient. So far the number of 

bits entering the algorithm is the number of samples multiplied by 8 

bits (64k × 8 = 512k), while the number of bits exiting the algorithm 

is the number of non-null decomposition coefficients multiplied by the 

32 + 32 = 64 bits used to encode each coefficient. 

The second part of the FORTRAN subroutine performs analysis, threshold 

application and synthesis to single-cluster SDD events. 

After analysis, threshold th application and synthesis, the difference 

between the coordinates of the cluster charge before compression and 

after synthesis is computed for each SDD event, as long as the percentage 

difference between the charge of the cluster before compression and 

after reconstruction. 

Fig. 5.7, Fig. 5.8, Fig. 5.9, Fig. 5.10, Fig. 5.11 and Fig. 5.12 show the 

value of the compression parameter c−1 for different threshold th values; 

in each figure the upper histogram represent the c values belonging 

to 500 SDD events analyzed, while the lower hystogram represents the 

c values related to SDD events whose c−1 value is less than 46 × 10−3 (c = 22). 

As shown in hystograms, the mean c values are lower than our target 

value c−1 =46× 10−3 for each threshold value selected. So far the 

multiresolution algorithms can reach an acceptable compression ratio 

by putting a threshold of 20 on analyzed coefficients. 

For what concerns the reconstruction error calculation up to now we 

could use only 20 single-cluster events. So far the hystograms reporting 

coordinate and charge difference before and after compression show a 

very poor statistics. 

For this reason the results we obtained on reconstruction error are 

pretty qualitative up to now: in particular performing the analysis on 

20 SDD events and using a threshold th level equal to 21 the differences 

on the centroid coordinates before and after compression are of 

the order of magnitude of the µm, whereas the difference between clus-

5.5 — Hardware implementation 

ter charge show a cluster underestimation of some percentual point. 

These qualitative results belong to the same order of magnitude of the 

compression algorithms implemented in CARLOS prototypes. 

Figure 5.7: c −1 values for th=20 

5.5 Hardware implementation 

The hardware we have chosen for the implementation of the wavelet 

based compression algorithm is a DSP chip from Analog Devices (AD): 

the ADSP-21160. The DSP belongs to the Single Instruction Multiple 

Data SHARC family produced by AD. It performs calculations both 

in fixed-point and in single precision floating point at the same speed. 

Our choice fell on this DSP also for this interesting feature, since it 

allows us to try two different architectures with a single chip. The chip 

has the following features: 

– 600 MFLOPS (32-bit floating point) peak operation; 

151

152 



Figure 5.9: c −1 values for th=22




153

154 



– 600 MOPS (32-bit fixed point) peak operation; 

– 100 MHz core operation; 

– 4 Mbits on-chip dual-ported SRAM; 

– division of SRAM between program and data memory is user selectable; 

– 14 channels of zero overhead DMA; 

– JTAG standard test access port. 

Particularly interesting in this chip is the amount of memory hosted onchip: 

4 Mbits are sufficient to store the algorithm program and at least 

2 SDD events (each one requires 512 Kbits). So far while processing 

one SDD event, an other one can be fetched into the internal SRAM 

using the DMA channels, so increasing the total throughput. 

The DSP has been bought together with an evaluation board and an 

integrated development environment software VisualDSP, that allows 

to write C code and download it to the DSP chip. The wavelet based


compression algorithm implementation on DSP is still in the design 

phase, so far no data concerning algorithm speed are available up to 

now for a quantitative comparison with the CARLOS chip prototypes. 

155

156 


Haar 

1D 

1 level 3 levels 5 levels 

Threshold value th P E P E P E 

0 7.78 3.02 e-15 9.05 7.11 e-15 9.12 1.26 e-14 

1 17.51 0.22 23.67 0.26 24.68 0.27 

2 31.23 0.65 38.11 0.62 40.01 0.63 

3 40.09 1.01 55.81 1.21 58.60 1.64 

4 44.28 1.25 63.48 1.56 67.08 1.71 

5 47.84 1.52 71.20 2.00 75.56 2.09 

6 48.78 1.61 74.80 2.26 79.87 2.38 

7 49.31 1.68 77.81 2.52 83.56 2.68 

8 49.71 1.74 80.38 2.79 86.71 2.99 

9 49.78 1.76 82.02 2.99 88.82 3.23 

10 49.87 1.78 83.41 3.19 90.70 3.48 

11 49.91 1.79 84.50 3.38 92.21 3.72 

12 49.94 1.80 85.17 3.50 93.20 3.89 

13 49.97 1.81 85.81 3.64 94.16 4.07 

14 49.98 1.82 86.25 3.75 94.81 4.21 

15 49.98 1.83 86.60 3.84 95.33 4.34 

16 49.99 1.83 86.85 3.92 95.72 4.44 

17 50.00 1.84 87.02 3.98 96.03 4.54 

18 50.00 1.84 87.12 4.02 96.20 4.60 

19 50.00 1.84 87.24 4.07 96.41 4.67 

20 50.00 1.84 87.32 4.10 96.54 4.72 

21 50.01 1.84 87.36 4.12 96.62 4.76 

22 50.01 1.84 87.40 4.14 96.69 4.79 

23 50.01 1.84 87.42 4.16 96.73 4.81 

24 50.01 1.85 87.43 4.17 96.76 4.83 

25 50.01 1.85 87.45 4.18 96.79 4.85 


the analysis has been performed on a number of levels equal to 1, 3, 5, 

using the Haar set of filters.


Haar 

2D 

1 level 3 levels 5 levels 

Threshold value th P E P E P E 

0 3.54 5.32 e-15 3.67 1.5 e-14 3.68 2.50 e-14 

1 18.90 0.26 22.06 0.28 22.21 0.28 

2 36.05 0.69 42.33 0.74 42.63 0.75 

3 46.42 1.07 55.90 1.19 56.34 1.19 

4 55.25 1.47 67.15 1.66 67.76 1.67 

5 60.69 1.80 74.78 2.07 75.50 2.09 

6 64.01 2.06 79.95 2.42 80.77 2.44 

7 66.46 2.30 84.03 2.75 84.96 2.77 

8 68.30 2.51 87.18 3.05 88.21 3.08 

9 69.73 2.70 89.64 3.33 90.75 3.36 

10 70.95 2.90 91.72 3.59 92.88 3.63 

11 71.87 3.06 93.25 3.82 94.49 3.87 

12 72.63 3.22 94.51 4.03 95.80 4.08 

13 73.20 3.35 95.46 4.21 96.78 4.26 

14 73.65 3.47 96.21 4.36 97.56 4.42 

15 74.06 3.59 96.84 4.51 98.20 4.57 

16 74.38 3.69 97.34 4.64 98.73 4.71 

17 74.53 3.75 97.63 4.72 99.05 4.80 

18 74.65 3.80 97.82 4.79 99.25 4.86 

19 74.76 3.85 98.01 4.85 99.44 4.93 

20 74.82 3.87 98.11 4.89 99.55 4.97 

21 74.87 3.90 98.20 4.93 99.64 5.01 

22 74.91 3.92 98.25 4.95 99.69 5.03 

23 74.93 3.94 98.29 4.97 99.74 5.05 

24 74.94 3.95 98.32 4.99 99.77 5.07 

25 74.96 3.96 98.35 5.00 99.80 5.09 


the analysis has been performed on a number of levels equal to 1, 3, 5, using 

the Haar set of filters. 

157

158 


Architecture Precision P E 

ieee754doub 2 −52 99.88 5.07 

ieee754sing 2 −23 99.88 5.11 

fixed(18) 2 −18 99.88 5.11 

fixed(15) 2 −15 99.88 5.11 

fixed(12) 2 −12 99.88 5.11 

fixed(9) 2 −9 99.88 5.11 

fixed(7) 2 −7 99.87 6.04 

fixed(5) 2 −5 99.81 12.75 

fixed(3) 2 −3 99.52 89.09 

Table 5.11: Mean values of P ed E on 10 SDD events (∆P ≈ ∆E ≈ 0.01), 

obtained with Simulink simulations

Conclusions 

The main goal of this thesis work was the search for compression algorithms 

and its hardware implementation to be applied to data coming 

out from the Silicon Drift Detectors in the ALICE experiment. 

ALICE and, in general, LHC experiments put very stringent constraints 

on the compression algorithms for what concerns compression ratio, 

reconstruction error, speed, flexibility and so on. For example data 

produced by the SDD have to be reduced of a factor of 22 in order 

to satisfy the constraints on disk space for permanent storage. So far 

many standard compression algorithms have been studied in order to 

find which one could obtain the best trade-off between compression 

ratio and reconstruction error, i.e. distortion introduced. It is rather 

obvious, in fact, that a high compression ratio such as 22 can only be 

achieved at the expense of some loss of information on the physical 

charge distribution over the SDD surface. 

Three hardware prototypes implementing data compression are presented 

in the thesis: the front-end chip CARLOS v1, v2 and v3. Their 

evolution from version 1 to version 3 reflects the architectural changes 

in the readout chain occurred during the 3 years of the work. Three 

major reasons can be used to justify these changes: 

– the necessity to work in a radiation environment, forcing us to 

choose a radiation-tolerant technology; 

– the lack of space for the SIU board, forcing us to change the 

readout architecture; 

159

160 

CONCLUSIONS 

– the change from a uni-dimensional (1D) compression algorithm to 

a bi-dimensional one (2D), in order to have the same compression 

ratio as in 1D, while using lower thresholds, thus losing a smaller 

amount of physical data. 

We plan that CARLOS v4 will be the final version of the chip: it will 

contain the 2D algorithm and will be designed to be compliant with 

the new readout architecture. It should be sent to the foundry before 

the end of 2002. 

One of the main features of these chips is that lossy compression can be 

switched off when needed and turned to lossless compression. Lossless 

data compression becomes necessary if compression algorithms implemented 

on the CARLOS chips are no longer applicable. For example 

the 2D compression algorithm does not work fine in presence of a 

slope on the anodic signal baseline. In this case on-line compression on 

the front-end has to be switched off and a second level compressor in 

counting room has to do the job. For this kind of application different 

compression algorithms have to be studied. 

In alternative to the 1D and 2D algorithms, our group in Bologna 

decided to study a wavelet based compression algorithm, in order to 

decide if it could be useful for a possible second level data compression. 

Our simulations proved that the algorithm show good performances 

for what concerns both the compression ratio and the reconstruction 

error. We are still working in order to obtain some more quantitative 

results and, at the same time, an implementation on DSP is planned for 

the near future in order to evaluate compression speed and how many 

DSPs would be necessary for the task. The use of DSP in counting 

room may be very convenient since, differently from ASICs, they are 

completely reprogrammable via software if needed. So far as many as 

different compression algorithms as wanted can be tried on the input 

data in order to find the best one.

Bibliography 

[1] ALICE Collaboration, “Technical Proposal for A Large Ion 

Collider Experiment at the CERN LHC”, December 1995, 

CERN/LHCC/95-71. 

[2] The LHC study group, “The Large Hadron Collider Conceptual 

Design”, October 1995, CERN/AC/95-05(LHC). 

[3] P. Giubellino, E. Crescio, “The ALICE experiment at LHC: 

physics prospects and detector design”, January 2001, ALICE- 

PUB-2000-35. 

[4] CERN/LHCC 99-12 ALICE TDR 4, 18 June 1999. 

[5] E. Crescio, D. Nouais, P. Cerello, “A detailed study of charge diffusion 

and its effect on spatial resolution in Silicon Drift Detectors”, 

September 2001, ALICE-INT-2001-09. 

[6] F. Faccio, K. Kloukinas, G. Magazzu, A. Marchioro, “SEU 

effects in registers and in a Dual-Ported Static RAM designed in 

a0.25µmCMOStechnology for applications in the LHC”, Fifth 

Workshop on Electronics for LHC Experiments, September 20-24, 

1999, pages 571-575. 

[7] K. Sayood, “Introduction to Data Compression”, Morgan Kaufmann, 

S. Francisco, 1996. 

[8] E.S. Ventsel “Teoria delle probabilità”, Mir edition. 

[9] S. W. Smith, “The Scientist and Engineer’s Guide to Digital Signal 

Processing”, California Technical Publishing, S. Diego, 1999. 

161

162 

BIBLIOGRAPHY 

[10] J. Badier, Ph. Busson, A. Karar, D.W. Kim, G.B. Kim,, S.C. 

Lee, “Reduction of ECAL data volume using lossless data compression 

techniques”, Nuclear Instruments and Methods in Physics 

Research A 463 (2001), pages 361-374. 

[11] R. Polikar, “The Engineer’s ultimate guide to wavelet analysis”, 

http://engineering.rowan.edu/˜polikar/WAVELETS/WTtutorial.html, 

2001. 

[12] P. G. Lemarié, Y.Meyer, “Ondelettes et bases hilbertiennes”, Rivista 

Matematica Iberoamericana, Vol. 2, pages 1-18, 1986. 

[13] E. J. Stollnitz, T. D. DeRose e D. H. Salesin, “Wavelets for 

computer graphics: a primer”, IEEE Computer Graphics and Applications, 

Vol. 3, NO. 15, pages 76-84, May 1995 (part 1) and 

Vol. 4, NO. 15, pages 75-85, July 1995 (part 2). Vol. 3, NO. 15, 

pages 76-84, May 1995. 

[14] P. Morton, “Image Compression Using 

the Haar Wavelet Transform”, 

http://online.redwoods.cc.ca.us/instruct/darnold/maw/haar.htm, 

1998. 

[15] B. Burke Hubbard, “The World According to Wavelets: the story 

of a mathematical technique in the making”, A K Peters, Ltd., 

Wellesley, 1998. 

[16] S. G. Mallat, “A Theory for Multiresolution Signal Decomposition: 

The Wavelet Representation”, IEEE Transactions on pattern analysis 

and machine intelligence, Vol. II, NO. 7, pages 674-693, July 

1989. 

[17] D. Cavagnino, P. De Remigis, P. Giubellino, G. Mazza, e 

A. E. Werbrouck, “Data Compression for the ALICE Silicon Drift 

Detector”, 1998, ALICE-INT-1998-41. 

[18] Pankaj Gupta and Nick McKeown, “Designing and Implementing 

a Fast Crossbar Scheduler“, Jan/Feb 1999, IEEE Micro.

BIBLIOGRAPHY 

[19] D. Cavagnino, P. Giubellino, P. De Remigis, A. Werbrouck, G. 

Alberici, G. Mazza, A. Rivetti, F. Tosello, “Zero suppression and 

Data Compression for SDD Output in the ALICE Experiment”, 

Internal note/SDD, ALICE-INT-1999-28 V 1.0. 

[20] P. Moreira, J. Christiansen, A. Marchioro, E. van der Bij, K. 

Kloukinas, M. Campbell, G. Cervelli, “A 1.25 Gbit/s Serializer 

for LHC Data and Trigger Optical Links”, Fifth Workshop on 

Electronics for LHC Experiments, September 20-24, 1999, pages 

194-198. 

[21] F. Wang, “BIST using pseudorandom test vectors and signature 

analysis”, IEEE 1988 Custom Integrated Circuits Conference, 

CH2584-1/88/0000-0095. 

[22] T.W. Williams, W. Daehn, “Aliasing errors in multiple input signature 

analysis registers”, 1989 IEEE, CH2696-3/89/0000/0338. 

[23] M. Misiti, Y. Misiti, G. Oppenheim and J. M. Poggi, “Wavelet 

Toolbox User’s Guide”, The MathWorks, Inc., Natick, 2000. 

[24] “Simulink User’s Guide: Dynamic System Simulation for Matlab”, 

The MathWorks, Inc., Natick, 2000. 

[25] “Fixed-Point Blockset User’s Guide: for Use with Simulink”, The 

MathWorks, Inc., Natick, 2000. 

163

hardware implementation of data compression ... - INFN Bologna

Create successful ePaper yourself

Delete template?

Save as template?