In Network Processing and Data Aggregation in

Athens University of Economics and BusinessDepartment of Comuter ScienceNetworking Issues in wireless sensor networksand the Data Aggregation problem: A surveyDimosthenis PediaditakisAthens, Moday 3 July 20061 st Supervisor :2 nd Supervisor :George C. PolyzosGeorge XylomenosA thesis submitted in partial fulfillment of the requirements of thedegree of Master in Computer Science of Athens University ofEconomics and Business, department of Computer Science

Networking Issues in wireless sensor networks and the DataAggregation problem: A surveyAbstract – In recent years, a great deal of research has been devoted to wireless sensor networks(WSNs). WSNs are networks consisting of small self-powered devices that communicate with each otherover the wireless medium and coordinate their operation in order to perform distributed sensing nearphysical phenomena. The total lifetime of the network is a major design factor for protocols, applicationsand in general algorithms used by a WSN. Therefore, it is always a good choice to avoid the transmissionof big amounts of data. Each sensor node produces locally raw data upon sensing events. These datausually flow to a central node (called sink) that collects them for further processing. Such a traffic patterntends to exhaust the limited resources of the sensor network (including the energy). In-network processingand in particular the data aggregation techniques, try to reduce the traffic inside the network by trading theexpensive, in terms of energy, communication for local computation at each node. In this paper, we firstlypoint out the main networking challenges for a WSN and give an overview of the data aggregationproblem. Then we continue by presenting a taxonomy of the most representative aggregation techniques.We finally discuss the main trade-offs that result from the use of data aggregation and conclude our work.Keywords – Wireless sensor networks, data aggregation, in-network processing, data-centric routing,sensor database systems, query processing.

1 IntroductionRecently, the technological advancements in hardware have led to the design ofextremely powerful chips. Gordon E. Moore stated in 1965 that the complexity ofintegrated circuits, with respect to minimum component cost, doubles every 24 months.This “law” had hold for several years. However, on April 13 2005, Gordon Moorehimself stated in an interview that the law may not hold valid for too long, sincetransistors may reach the limits of miniaturization at atomic levels. Now that thecomputational speed is bounded by physical limitations, the interest of hardware expertshas been attracted to miniaturization of devices. The new trend of compacting thesystems enabled the development of coin-sized computing devices that are capable ofproducing digital representations of real-world phenomena. These devices are widelyknown as wireless sensors and they usually consist of sensing, data processing, andcommunicating components. The ability to communicate with each other throughwireless interfaces enables the deployment of a big number of sensor nodes in a fieldforming a wireless sensor network (WSN). In order for sensors to be both small andinexpensive, they have several resource constraints:Low bandwidth communication – The bandwidth of wireless links is usually limitedto a few hundred Kbps. Network doesn’t provide quality of service, the latency is highlyvariable and the loss of a packet is a frequent phenomenon.Power consumption – Most of the times sensors are battery powered. The batteriesthat are used are non rechargeable, irreplaceable and very small. Therefore, whiletraditional networks aim to achieve high quality of service, sensor network protocolsfocus primarily on power conservation. Prolonging the network’s lifetime is the maingoal. In addition to that a WSN must support parameterized trade-off mechanisms thatprovide the end user with the option of prolonging network lifetime at the cost of lowerthroughput or higher transmission delay.Computation – The node has limited computational power but it is usually adequate tocover the needs of the communication, application and sensing activities.Sensing accuracy – Signal processing functions convert physical events into internaldata representations. Sensed data, due to limitations of the sensor, may contain

environmental noise inducing this way uncertainty in readings. A damaged sensor mightalso generate inaccurate data.In Figure 1 a typical example of a sensor node (Berkeley MICA Mote) is illustrated sothat the reader is able to realize its tiny size. Table 1 provides us with the hardwarecharacteristics of the MICA Mote.ProcessorStorageRadio BandComm. RangeData RateTx PowerReceive Power4Mhz, 8bit MCU(ATMEL)512KB916Mhz33 m40 Kbits/sec12 mA1.8 mAFigure 1. The Berkeley MICA MoteTable 1. Mica Mote’s characteristicsWireless sensor networks can be used in awide range of application domains:• Environmental monitoring (coordinated sensing or information gathering in adisaster area).• Supervising items in a factory warehouse.• Intelligent building management• Organization of vehicle traffic in a large city (traffic routing).• Military (target recognition and tracking).• Medical and health care.Each one of the above applications has its own network, hardware and nodedeployment requirements. WSNs lack of standardized solutions because they areapplication specific networks, i.e., design requirements of a sensor network change dependingon the application. However, no matter the kind of the application, the communicationscenario is always identical (Figure 2). A user sends several tasks in the form of queriesto the network. The user can either be directly attached to a gateway node or he can alsobe connected remotely through an intermediate network (e.g. internet or via satellite).After the query reaches the WSN (sink), it is automatically translated into an internal

Thus, in-network processing is one of the most effective ways to minimize the trafficinside a WSN and it is generally performed in the form of data aggregation. A typicalquery asks for the average/max/min of the sensed values within a given area. Assume thatthe user is placed at the sink and injects into the network a query Q, asking the averagetemperature of area A. The query can be answered either by using a direct-deliveryapproach or by performing data-aggregation:Direct DeliveryAfter the propagation of Q to area A each node inside A is required to sendits own readings back to the host node (sink) for processing. Afterreceiving all data packets from the source nodes, the sink aggregateslocally all of the data into a final value and report the value back to theuser.Distributed in-network aggregationIn this technique, a sensor network forms a reverse multicast treeas shown in Figure 3 where the sink injects query Q inside the network.The sensor nodes start sending back their sensed values related to thephenomena of interest. All the information flows in a top- down mannerand as the packets are coming from multiple sensor nodes they areaggregated inside the network, before they reach the sink. Forexample, the data of sensors A and B are aggregated at node E. In thesame way sensor node F aggregates the data from sensor nodes C andD. When a node receives two or more values from its children, itforwards down only 1 value (the aggregated) reducing this way thetotal number of transmitted messages. In addition, the final value thatreaches at the sink will be the requested aggregate.

Figure 3. Aggregation tree is a reverse multicast treeThe plan for the remainder of this paper is as follows: In Section 2, we present themain networking issues of the wireless sensor networks and also the data-centric routing.In section 3 we discuss the data aggregation problem and distinguish the different typesof aggregates that may be computed over a WSN. In section 4 we make a taxonomy ofthe current aggregation techniques that are widely accepted. In section 5 we review themain trade-offs resulting from the use of data aggregation. Section 6 consists of the finalsolution of the paper.

The following figure (Figure 4) visualizes the three different ranges mentioned above:TRIRDRTR : Transmission range, DR = Detection range, IR = Interference rangeFigure 4The transmission range (and of course both the detection and interference ranges) canbe increased by allowing the transceiver to consume more energy. It has been found thatthe power required by a sender in order to reach a receiver at a given distance d isproportional to d 2 (as the height of the antennas of the communicating nodes is lower andclose to the ground the exponent is increased to 4 at most). Many hardwaremanufacturers support the dynamic change of the transmitting power and so thetransmission range can be adapted to the current needs of the network. This is not thecase though for many WSNs due to cost-per-node constraints.Having in mind the previous conversation about the behavior of the wireless mediumit is time to mention the factors that mainly affect the signal propagation process:Shadowing: The environment that a sensor node is placed is more likely to bedynamic and thus changes may occur unexpectedly. Often, an object may obstruct thecommunication between two or more sensor nodes. A nice example would be a biganimal sleeping in the middle of the network.Path loss: Electromagnetic wave attenuates as the distance between the twocommunicators increases. This is the reason why we need to transmit at higher powerlevels in order to reach a distant node.Fading: The characteristics of the channel change over time and location, leading invariations of the power of the receiving signal.Interference: In a network it is almost sure that all sources transmit and all receiverslisten for signals at/to the same frequency. If there is not a mechanism that can

synchronize perfectly all the nodes it is undoubted that two or more signals will reach thesame receiver at different power levels. That is called interference.Multipath propagation: The same signal is possible to follow two or more differentpaths on its way to the receiver because of the possible reflections (on objects or even onthe ground). The multiple paths have various lengths; different path lengths lead todifferent reception times and this finally results in the blurring of the signal at thereceiver.Another function that the physical network layer may support is the physical carriersense. This is done by listening to the channel for possible transmissions in order to avoidsending signals that may interfere with others. Carrier sense is performed at random timesexcept for the case that the MAC sublayer demands it.To get an idea about the specifications of the wireless channel of a typical node of aWSN we present the following table (Table 1) that demonstrates the transceivercharacteristics of the most popular commercial sensor nodes (for more information thereader is encouraged to read [14]).PrototypeUCBerkeleyCrossbowWeCUCBerkeleyCrossbowReneUCBerkeleyCrossbowMICAUCBerkeleyCrossbowMICA2InteliMote(2003)MicrostrainGalbreath et al.(2003)(1999) (2000) (2002) (2003)Radio TR1000 TR1000 TR1000RFChipcon WirelessMonolithicsCC1000 BT ZeevoDR-3000-1315, 433Frequency 868 / 916 868 / 916 868 / 916 orBand MHz MHz MHz 868 / 9162.4 GHz 916.5 MHzMHzStandard - - - -IEEE802.15.1-SpreadSpectrumNo No No Yes (s/w) Yes -Data Rate 10 kbps 10 kbps 40 kbps 38.4 kbps 600 kbps 75 kbpsTable 2. Transceiver characteristics of the most popular commercial sensor nodes

2.1.3 Proposed schemesSeveral systems suggested physical layer solutions that provide APIs for upper layerslike setting the radio into different states (sleep, idle, reception, transmission). It isnoticeable the lack of standardization at lower and the (physical) sensor hardware. In thefollowing paragraphs we mention the most significant and popular schemes that wereproposed.Bluetooth [13]Operates in the 2.4 GHz ISM band. It employs binary Gaussian Frequency ShiftKeying (GFSK), 1 MBaud - symbol rate of. Performs frequency hopping spreadspectrum, randomly hopping across 79 channels (1 MHz each one), performing 1600hops per second. Although it is frequently suggested for sensor applications, theBluetooth physical layer is not very suitable for WSNs [12]. This is mainly because itconsumes a lot of power searching the band for the network on a packet rate time scale(frequency hopping). What is more, the narrow channel separation makes the phase noiserequirements of signal sources more difficult.IEEE 802.11b WLAN [15]Operates in the 2.4 GHz ISM band. This worldwide IEEE standard, specifies threedifferent layer-1 options at 1 Mb/s and optionally 2 Mb/s:• Infrared,• Frequency hopping spread spectrum 2.4 GHz and• Direct Sequence spread spectrum 2.4 GHz [DBPSK at 1 Mb/s and DQPSK at 2Mb/s. Using complementary code keying CCK it achieves till 11Mb/s].The 1- and 2-MHz direct sequence 802.11 physical layer is a possible option for aWSN because it has minimal hardware requirements, the provided data rate is more thanenough and it has not the problems of the frequency hopping systems (like bluetooth).Using the extended version (CCK) of the direct sequence option has a big cost in powerconsumption and transceiver hardware complexity (e.g. have to support complexidentification and decoding functions for the CCK sequences).μAMPS [12]

Proposes a bottom (physical layer)–up (application layer) approach in thedevelopment of a protocol for a wireless sensor network. Application, MAC and physicallayers have to be tightly integrated with the hardware of the sensor node. For example theenergy penalties of switching states or the consumed power at each state of a transceiver,might determine the policy of the MAC layer scheme or even affect the network layerrouting algorithm. Multiple access schemes (Time Division Multiple Access, FrequencyDivision Multiple Access) are compared and the same is also done for modulationschemes (binary, M-ary modulation). It is found that a hybrid TDMA-FDMA scheme andthe M-ary modulation scheme achieve significant power savings. Another energyexpensive factor is the frequent turning on/off of the transceiver of a node because thecontrol input spends not negligible time on setting the right voltage to the transceiver(start-up energy). All the previous notes show that layer-1 has not to be considered as ablack box when designing an upper level protocol or even an application for a WSN.2.2 Data Link Layer (MAC)2.2.1 Medium Access Control (MAC) functionalityAbove the Physical Layer (Layer-1) resides the Data Link Layer (Layer-2). Havingonly the tools that are provided by the layer 1 in means of services, the link layer isresponsible for the transmission of data packets between two communicating nodes. Thecommunication pattern may be point-to-point or even point-to-multipoint. Nevertheless,this function is not as simple as it sounds. There are several services that the link layermay offer to the upper layers like:• medium access,• packet encapsulation in link-layer frames and data frame detection,• flow control,• error detection/correction,• data stream multiplexing and• reliable communication mode.

Medium access functionality is so important that it constitutes an individual sub-layer,the Medium Access Control sub-layer, widely known as MAC. In order to understand themajor importance of the MAC sub-layer, its primary functions are indicated below :• medium access control before transmitting,• bit-stream fragmentation into frames upon reception,• level-2 frame encapsulation before transmission,• insertion of checksums for error detection,• insertion of the source-destination MAC addresses inside all transmitting framesand• frame filtering by checking the destination MAC address of the packet.For the rest of this section, we will be concerned for the medium access problem.2.2.2 MAC issues for wireless communicationsBefore analyzing the MAC sub-layer issues from the perspective of the WSNs, it isuseful to review the main problems that a general purpose wireless network faces.Wireless (radio, infrared, optical) medium has to be tightly controlled since it has abroadcasting nature. When a host transmits, the nodes inside its transmission range willlisten to it, the nodes inside its detection range will just detect its signal and finally thenodes within its interference range will receive a vague and confusing signal. When twoor more nodes within the same “neighborhood” transmit packets at the same time, thenearby nodes that listen to these transmissions will get a confusing mixed signal andfinally they will discard the received packets. This phenomenon is called collision andresults in the loss of all the transmitted packets. When a collision occurs, the senders haveto retransmit their data.The need for a mechanism that will play the role of the traffic controller is imperative.This mechanism must give everyone the opportunity to talk, prevent the monopolizationof the conversation by a single participant and guarantee that no one can interrupt the

other. There are three general families of protocols that are able to control the access tothe wireless medium. The protocols of each category are presented briefly:• Channel Partitioning Protocols (Contention free)Certain assignments are used to avoid contentions. They are more applicable to staticnetworks and/or networks with centralized control.o Time Division Multiple Access (TDMA) – All users share the same frequencyby dividing it in timeslots. Each one waits for his turn (the time that corresponds to theirtimeslot) and uses all or part of the provided bandwidth.Advantages:a) It naturally avoids collisions so there are no extra overheads.b) It has short duty cycle.c) FairnessDisadvantages:a) Needs synchronization mechanisms.b) It requires the formation of communication clusters.c) No good scalability because the timeslots are difficult to be assigneddynamically.d) Limited transmission rate because even if a node is the only one that hassomething to send it has to wait for its turn.o Frequency Division Multiple Access (FDMA) – The channel is divided intoequal smaller frequency bands that are called subdivisions. Each subdivision is assignedto a node and so every participant can use only a fraction (with equal size for all) of thetotal channel bandwidth. The FDMA has the same advantages as TDMA. The drawbacksare identical too except for the fact that even if a node is the only one who transmitssomething it is not obliged to wait for its turn whereas it may use only a fraction of thebandwidth of the channel.o Hybrid TDMA/FDMA – This approach combines the two previous protocolsand the medium is divided in a time-frequency space basis. It is a very frequent solutionfor wireless communications but it also has similar inefficiencies with TDMA andFDMA.o Code Division Multiple Access (CDMA) – The channel is divided neither intotime nor by frequency but each node is assigned a code (chipping sequence) with which it

encodes the data before transmission. The codes are “mutually orthogonal” to each other,which means that nodes can transmit simultaneously without the possibility of collisions.Of course the selection and the distribution of the codes between the participants is not aneasy task, especially when their total number is high. The receiver listens to a signal thatis spread on the air consisting of multiple encoded sub-signals transmitted from eachsender. By the code of the sender it is an easy process to decode only the signal ofinterest among the rest of them.Advantages:a) Naturally avoids collisions so there are no extra overheads.b) Fairness.c) More efficient use of the channel.d) Theoretically unlimited number of users that CDMA can support, incontrast to TDMA (finite time-slots) and FDMA (finite sub-channels).Disadvantages:Near-far problem: Because all users transmit at the same frequency, internalinterference generated by the system is the most significant factor in determining systemcapacity and call quality. Since one transmission is the other's noise the Signal-to-noiseratio (SNR) can be high in many situations. Thus, it is not always easy to detect andreceive a weak signal among stronger ones.• Random Access Protocols (Contention based, on demand allocation)These protocols are aware of the risk of collisions of transmitted data, and are moresuitable for mobile Ad-Hoc networks.o Pure ALOHA – When a node has data to send it directly transmits the datapacket. It is very possible for a packet to collide with other packets that were transmittedat the same time because all the nodes share the same medium and use the samefrequency bands. When the sender realizes the collision then it performs a retransmissionwith probability p otherwise it waits for a fixed amount of time.Advantages :a) Supports many simultaneous users,b) Ease of management,c) Speed in initial communication.Disadvantages :

a) It has a maximum throughput of 18.4% due to frequent collisions giventhat many nodes try to “talk” at the same time.b) There is no sensing mechanism to inform nodes about the channelcondition.o Slotted ALOHA – Is an improved version of Pure ALOHA. Nodes canperform transmissions only at the beginning of a timeslot. A slot is “wasted” if on itsbeginning two or more stations attempt to send a packet, otherwise (only one stationstarted transmitting on its beginning) the transmission in the duration of one time slot isunremitting. In terms of performance it manages to improve the maximum throughputfrom 18.4% to 36.8%.o Carrier Sense Multiple Access (CSMA) – In ALOHA (pure and slotted) astation never takes into account the other nodes’ actions but acts selfishly. Neithersomebody listens to the channel for possible ongoing communications nor does it stop thetransmission when a collision occurs. This is exactly what the CSMA scheme is doing.The transmitter firstly tries to detect the presence of an encoded signal from anotherstation. If a carrier is sensed, the node waits for the transmission in progress to finishbefore it senses again the medium. If the channel is idle the node begins frametransmission.Advantages :a) Simplicity of mechanism.b) Extremely scalable.Disadvantages :a) A node has not a way to detect that a collision has occurred so that he canstop sending data through the wireless interface.b) Collisions due to hidden/exposed terminals.o CSMA/CD – Is an extension of the simple CSMA scheme described above.While a station is transmitting it is possible a collision to occur. If that node doesn’t stopthe communication process, the entire frame will be wasted and the same will happenwith all the other active transmitters. Collision Detection (CD) gives nodes the ability tostop on time the transmission thus bound the amount of wasted bandwidth during acollision. The only drawback of CSMA/CD is that it has extra hardware requirements(the ability to send and receive at the same time is not simple when the communication ishalf-duplex) and this leads to more expensive transceivers.

o CSMA/CA – Is also an extension of the simple CSMA scheme. Each stationfirstly listens to the medium for on-going transmissions. If the medium is busy it choosesa random back-off time-value d. Afterwards, a counter :- counts down d while the channel is (sensed) idle and- freezes when channel stops being idle.When the counter hits zero the station retries to send the frame by starting the aboveprocess from the beginning. Although there is no way for a node to detect a collision (likein CSMA/CD) it is less likely for all waiting nodes to retry transmissions simultaneouslywhen the channel is sensed idle.All the CSMA schemes face a common problem that is known in wirelessnetworks bibliography as the hidden and exposed station problem.Hidden Terminal ProblemIn Figure 5, node B is within the transmission range of C, but A is not.Correspondingly B is within the transmission range of A but C is not. Let us suppose thatC is currently transmitting data to B. If in the middle of the CB transmission, station Aattempts to communicate with B it will not listen that the medium is busy because A isout of the range of C and it will send also its own data to B. That is, a collision willhappen and B will not receive anything from none.ABCFigure 5. Hidden Terminal ProblemExposed Terminal ProblemIn Figure 6, the Exposed terminal problem is illustrated. Station C sends data to D. Bis within the transmission range of C and A is within the transmission range of B. Bwants to send data to A but doesn’t sense the channel as idle because of the CDcommunication. Communication CD is very important because B is a pathetic listener

of it and as a consequence B already receives packets from C (even though B is not thedestination). The result of this scenario is that B decides to defer its own transmission toA. This is not a right decision because there is no possibility of a collision at A (A is notwithin the transmission range of C).ABCDFigure 6. Exposed Terminal ProblemSolution:Protocol 802.11 uses a nice technique to overcome the above problems. It uses twokinds of control frames the RTS (Request to Send) and CTS (Clear to Send). RTS is usedby the transmitter to reserve the channel when senses the channel to be idle. Afterwards,it waits until it receives the CTS frame by the receiver indicating that the latter is ready tostart “listening”. That’s it, the sender is indirectly informed that someone else is keepingbusy the receiver by not receiving CTS within a given timeout t.2.2.3 MAC Properties for WSNsWhen designing a protocol for a system, the focus should be tuned on the designconsiderations and the operational requirements of that system. These specialcharacteristics may include the physical ones (hardware, node size, supportedtechnologies etc.), the practical ones (e.g. what kind of applications will run on top of theprotocol stack) and finally the performance related ones (scalability, delays etc.). Insection 2.2.1 we reviewed the functions that a MAC layer must support and then we

made a discussion over the issues that come up when communicating through thewireless medium. The next step in designing an effective MAC scheme for WSNs is thedefinition of the properties that it should have.A general purpose MAC-layer protocol takes into serious consideration the assuranceof:Fairness: Ensure that all nodes have equal opportunities to access the medium. It isalso important for the node that acquired the medium to use it prudentially.Small Latency: It is a vital requirement for a network to respond quickly to the needsof an application. If a MAC scheme is fair enough but has very strict and complex rulesthen it is possible that it will suffer from lengthy delays (the time elapsed between themoment a station requested to send a frame and the moment it finally managed to startsending the first bytes).High throughput: Wasted network resources (like bandwidth) due to stiff architectureof the lower level affect the performance of the upper layers. The MAC layer sets theupper bound of the network throughput and thus it has to achieve high channelutilization.WSNs have very tight constraints in power consumption, computational capabilities,transceiver strength and storage size. Very few of their characteristics are common withthese of a traditional wireless system like GSM for example. It sounds a bit weird that afair, with small latency and high throughput MAC may not be ideal for a WSN. However,this is true. A MAC protocol designed for a sensor network is supposed to be “welldefined” if:Is energy efficient: Sensor nodes use batteries for power supply. These batteries areusually small sized and it is difficult (if not impossible) to be recharged. The totalnetwork lifetime then, is a major design factor. In section 2.2.4 we discuss why existingMAC schemes for wireless systems are not suitable for WSNs. Additionally, the majorreasons for energy waste are defined.Has good scalability: Scalability to the change in network size, node density andtopology is an additional property of great importance. WSNs are frequently deployed inareas where the topology changes rapidly. Nodes may fail at any time (due to hardwarefailures or empty battery), change location unexpectedly and finally new nodes may joinlater. A good MAC protocol should easily adapt to such network changes.

Idle listening: A network interface is possible to support several operation states, eachone consuming different amounts of energy. Usually, the hierarchy is the following (P xxrepresents the average consumed energy of state xx) :P sleep in order to listen to an idle channel to receive possibletraffic. This amount is not negligible (50-100% of the energy required for receiving) andis larger than P sleep as it requires to be powered on more circuit elements of thetransceiver.Overmitting: Happens when a node transmits a packet but the receiver is not ready toreceive it (may be in the sleep state). The packet is dropped.It would be helpful to review some known MAC schemes for the case of a WSN (wehave already presented them at section 2.2.4).TDMA/FDMABoth of them are contention free mechanisms, so collisions never occur. Every onelistens in/to its own timeslot/frequency resulting in a zero probability of overhearing /overmitting. There is no need for control packets to be exchanged before thecommunication of two nodes. The problem with TDMA/FDMA is the difficulty in themanagement of the inter-cluster communications. Moreover, when topology changes(happens very frequently), the medium has to be statically reassigned between a new setof nodes. This allocation is not an easy process and in addition to the need for intraclustersynchronization, the solution of TDMA/FDMA is not appropriate for WSNs.CDMAIn section 2.2.2 we reviewed briefly how does CDMA work. We also made a smalldiscussion on the near far problem. The solution to this problem is to adjust dynamicallythe transmission power of each node so that for a given receiver, all signals reach thereceiver with the same power. In cellular telephony systems CDMA is used widelybecause in a cell, all mobile nodes interact only with the base station (BS). BS controlsthe transmitting power of all the nodes within the cell. In an ad-hoc communicationpattern like a WSN, this problem has not such an easy solution. Every one can send andreceive messages. The ideal transmission power for a particular receiver may induce a big

amount of noise to the channel for a neighbouring node/receiver that may not be relatedto our communication process. A CDMA-Based MAC Protocol for general purposeMobile Ad-Hoc networks has been proposed [17]. It uses channel-gain informationobtained from overheard RTS and CTS packets over an out-of-band control channel. Theimplementation of this MAC protocol for WSNs is very complex as it requires specialtransceiver capabilities and also needs a second control channel. Furthermore, it is basedon the overhearing and as we have seen above the former is a primary reason for energywaste.IEEE 802.11 [15]Whenever IEEE 802.11 is used for WSNs, it operates in the Ad-Hoc mode withDistributed Coordination Function (DCF). Many control packets (RTS-CTS) are used inorder to avoid the hidden/exposed terminal problems (see section 2.2.2). However controlpackets and possible collisions are not the main energy waste reasons. Recent work [16]has shown that the energy consumption using the 802.11 MAC is very high when nodesare in idle mode. Idle listening consumes about 50-100% of the energy that is required toreceive data. Several measurements have shown that idle:receive:send ratios are1:1.05:1.4 . To sum up the above observations an IEEE 802.11 MAC protocol is notsuitable for a WSN in terms of energy consumption but due to its big popularity it isusually a choice for several experiments and simulations with small modifications.2.2.5 Proposed MAC schemes for WSNsIn the literature of sensor networks, several MAC protocols/schemes have beenproposed.Most of them use techniques like:• turning off the radio when channel is idle,• having two separate channels (one for data and one for control),• supporting several operating states with different energy consumption levels and• performing power control (use a variation in the transmission power).

It is not within the scopes of this paper to describe them one by one. The reader isreferenced to read [2] for a nice survey on the current MAC Protocols for WirelessSensor Networks. To get an idea of how does a typical MAC protocol for WSNs operatewe describe in short the S-MAC protocol.S-MAC [1]The designers of S-MAC set energy conservation and self-configuration as primarygoals, while per-node fairness and latency were less important for them. It is known thatlengthy packets lead to higher energy consumption because in the case of a collision thewasted energy is higher than that of a smaller one (more bytes are transmittedineffectually). Additionally, overhearing and control overhead due to large packets, spendserious amounts of energy. Here comes the first technique of S-MAC, message passing,to achieve efficient transmission of a very long message. It divides it into smallerfragments and then transmits them in bursts. Unfortunately, this is not a fair modification,as the nodes that have more data to send acquire more times the medium in a per-hopMAC perspective. This drawback though, is not so crucial for WSNs comparing to theenergy savings. Another technique that is used is the scheme of periodic listen and sleepof a node. In sleep mode, only a few circuit elements of the transceiver have to bepowered on (radio is off), leading to minimal power needs during this operating mode. Itmainly saves the expended energy due to idle listening but the latency is increased whilethe sender has to wait for the receiver to “wake up”. Periodic listen-sleep is implementedusing synchronization to form virtual clusters of nodes on the same sleep schedule inorder to minimize additional latency through their coordinated operation. After extensiveexperiments the authors of [1] found that S-MAC can reduce the energy consumption upto 2-6 times (when messages sent every 1-10sec) compared to IEEE 802.11. What ismore, S-MAC supports parameterized tradeoffs between energy and latency.2.3 Network Layer2.3.1 Routing challenges in WSNsIn the first look, wireless sensor networks seem to have a lot of similarities with acommon Mobile Ad-Hoc Network (MANET). Nodes are mobile and they can change

wild animals are monitored inside a park of several thousands of sq. meters the placementof nodes is totally different from the case of a factory that supervises items in awarehouse.• Particular Traffic patternsA typical communication scenario is as follows: A physical phenomenon occurs neara field of sensors. A big number of sensors events are triggered resulting in the massproduction of messages (containing values, IDs etc.). These messages have to be routedin a multi-hop fashion to a single point, the sink. The routing pattern looks like a reversedmulticast tree where the destination of all the packets is a single node and there aremultiple senders (many-to-one). Figure 7 demonstrates this pattern. This, however, doesnot prevent the flow of data to be in other forms (e.g., multicast or peer to peer). Additionally,since the data being collected by multiple sensors is based on common phenomena, redundantdata will surely be propagated inside the network.A B CEFDGSINKFigure 7. Reverse multicast tree• Robustness to network DynamicsThe nodes inside the area of deployment may change locations (mobile nodes). Thisleads to frequent alternations of the “neighborhood” of a sensor. Moreover, several nodesmay run out of battery or face hardware failures and become dead. The network has tomaintain its connectivity no matter how many or how frequent are the topology changes.• Scalability (Using localized & distributed algorithm)

A routing algorithm is necessary to perform well no matter the network size ordensity. It is widely known that localized (need only local information) and distributedalgorithms scale very well.• Energy efficiencySince sensor nodes are battery powered it is of major concern the routing protocol tobe as energy efficient as possible. The network may have to operate for months withoutthe replacement or recharging of the power supplies. MANETS in contrary, usually havethis option and thereby many end-to-end routing protocols proposed for Ad-Hocnetworks are not applicable for WSNs. In addition, redundant data have to be eliminatedinside the network and load balancing between the nodes is essential since some nodes(especially the ones that are placed close to the sink) tend to consume more energy (theyperform too many transmissions - receptions).To minimize energy consumption, several routing techniques have been proposed andthey employ some well-known routing tactics as well as tactics special to WSNs, e.g.,data aggregation and in-network processing, clustering, different node role assignment,and data-centric methods were employed. Almost all of the routing protocols can beclassified according to the network structure as flat, hierarchical, or location-based.A taxonomy of the current routing schemes follows:Data CentricAll nodes have the same role and data is named in an attribute-value fashion. Therouting is performed regarding the data contents and no by using predefinedshortest paths (Directed Diffusion, SPIN-1 and SPIN-2).Clustering / HierarchicalClustering is performed to the nodes so that cluster heads can dosome aggregation and reduction of data in order to save energy(LEACH, TTDD, TEEN).Geographic / Location basedThey utilize the position information to relay the data to the desiredregions rather

than the whole network (GAF, GEAR).Presenting all the routing protocols is beyond the scope of this paper as it is focusedto the networking issues of the wireless sensor networks that are related (directly or not)with the in-network processing and especially with the data aggregation. For a nicesurvey of the current routing schemes the user is referred to [18].2.3.2 Data-centric RoutingMany end-to-end routing schemes have been proposed in the literature for mobile adhocnetworks but they are not appropriate under the requirements that we discussed in theprevious section (2.3.1). It is not possible to build a global addressing scheme for thedeployment of a huge number of sensor nodes. Therefore, classical IP-based protocolscannot be applied to sensor networks because of the great overhead that a binding serviceinduces. Most of the existing network protocols for MANETS (like DSR, AODV etc)assume a global identification of nodes, so they are not applicable to WSNs. Sometimesgetting the data is more important than knowing the IDs of which nodes sent the data.Data is usually transmitted from every sensor node within the deployment region withsignificant redundancy. Since this is very inefficient in terms of energy consumption,routing protocols that will be able to select a set of sensor nodes and utilize dataaggregation during the relaying of data have been considered. This consideration has ledto data-centric routing, which is different from traditional address-based routing whereroutes are created between addressable nodes managed in the network layer of thecommunication stack. In data-centric routing, the sink sends queries to certain regionsand waits for data from the sensors located in the selected regions. Since data is beingrequested through queries, attribute based naming is necessary to specify the properties ofdata. For example, if the query is something like [temperature > 60F], then sensor nodesthat sense temperature > 60F only need to respond and report their readings. Using datacentricrouting it is possible:(a) to combine the data on their way back to the sink,(b) to eliminate duplicates,

(c) to perform smart caching inside the network,(d) to compute several aggregate values inside the network and finally(e) to save energy by reducing the number of routed messages (less transmissionsmean less energy consumed by transceiver).Therefore, there are two routing models that a WSN can use:Address-centric Protocol (AC): Each source independently sends data along theshortest path to sink, based on the route that the queries took (“end-to-end routing”).Data-centric Protocol (DC): The sources send data to the sink, but routing nodeslook at the content of the data and perform some form of aggregation / consolidationfunction on the data originating at multiple sources.The main reason for the use of data-centric routing schemes is the reduction of theconsumed energy as we already have stated above. To get an idea of how this goal can beachieved we consider a simple scenario (Figure 8). A heavy track passes near by a terrainwhere small wireless sensor nodes are deployed. The nodes are able to perceive nonnormal sound levels in the environment. Let us suppose that the only nodes that areplaced near the passing-by track are Node A and Node B. Nodes C and D are within thetransmission range of A and only node D is reachable by B. Both nodes C and D cantransmit directly to sink.Case 1: Address-centric routingEach source (A,B) sends its own information separately to the sink. The shortest pathsare used, so Node A routes packet_A to sink through C and Node B routes packet_B tosink through D. There is no way for the network to “know” that packet_A and packet_Binclude identical values for the measured noise. Therefore, all the intermediate nodes(C,D) forward blindly everything they receive to the next hop. The data-contents ofpackets are not accessible to the network layer as the only information provided is theaddress of the sender and the address of the receiver. Think the same scenario for 100 or1000 small sensors near a volcano and try to realize how many transmissions of identical(duplicate) packets will be performed. As a result, a big amount of energy will bedissipated.Case 2: Data-centric routing

In the data-centric approach we assume that routing is not optimal. Node A routespacket_A to sink through a longer path (than that used in Address centric routing) withNode D being the last hop before sink. The data is named (attribute-value form) and thenetwork layer protocol can access this information. It is possible now to perform severalactions depending on the content the data packets. The most reasonable action that arouting protocol can perform is duplicate suppression. Thus, Node D checks its cache andrealizes that the packet_A (reached later than packet_B) is identical to packet_B and itnever transmits it to sink.Node ANode BNode ANode BNode CNode DNode CNode DSinkSinkDuplicateSuppressionAddress CentricData CentricFigure 8. Address-Centric Vs Data-Centric Routing2.3.3 Directed Diffusion [3]In-network processing must be supported by the network layer mechanism that isused. The implementation of data-centric routing protocols is ideal for such a purpose. Inthis section we present one of the most common routing protocols of this category, theDirected Diffusion.Directed diffusion is a data-centric (DC) and application-aware paradigm in the sensethat all data generated by sensor nodes is named by attribute-value pairs. The main ideaof the DC paradigm is to combine the data coming from different sources (in-networkaggregation) by eliminating redundancy and minimizing the number of transmissions;thus saving network energy and prolonging its lifetime. Unlike traditional end-to-end

outing, DC routing finds routes from multiple sources to a single destination that allowsin-network consolidation of redundant data.In directed diffusion, sensors measure events and create gradients of information intheir respective neighborhoods. The base station requests data by broadcasting interests.Interest describes a task required to be done by the network. Interest diffuses through thenetwork hop-by-hop, and is broadcasted by each node to its neighbors. As the interest ispropagated throughout the network, gradients are setup to draw data satisfying the querytowards the requesting node, i.e., a BS may query for data by disseminating interests andintermediate nodes propagate these interests. Each sensor that receives the interest sets upa gradient toward the sensor nodes, from which it receives the interest. This processcontinues until gradients are setup from the sources back to the BS. More generally, agradient specifies an attribute value and a direction. The strength of the gradient may bedifferent towards different neighbors, resulting in different amounts of information flow.At this stage, loops are not checked, but are removed at a later stage. When interests fitgradients, paths of information flow are formed from multiple paths and then the bestpaths are reinforced so as to prevent further flooding according to a local rule. In order toreduce communication costs, data is aggregated on the way. The goal is to find a goodaggregation tree which gets the data from source nodes to the BS. The BS periodicallyrefreshes and re-sends the interest when it starts to receive data from the source(s). This isnecessary because interests are not reliably transmitted throughout the network.All sensor nodes in a directed diffusion-based network are application-aware, whichenables diffusion to achieve energy savings by selecting empirically good paths and bycaching and processing data in the network. Caching can increase the efficiency,robustness and scalability of coordination between sensor nodes which is the essence ofthe data diffusion paradigm. Other usage of directed diffusion is to spontaneouslypropagate an important event to some sections of the sensor network. Such type ofinformation retrieval is well suited only for persistent queries where requesting nodes arenot expecting data that satisfy a query for duration of time. This makes it unsuitable forone-time queries, as it is not worth setting up gradients for queries, which use the pathonly once.

3 Data Aggregation in WSNs3.1 GeneralSo far, we have discussed the networking issues of a WSN, and have concluded thatin order to minimize the dissipated energy during its operation, the transceivers have tobe used very stingily. Supposing that the lower layers (physical and MAC) are energyefficient, it is up to the network and the application layer protocols to follow aconservative strategy regarding the number transmitted/received messages. In section 2.3we reviewed a communication paradigm that is data-centric and suppresses duplicatesensor readings, inside the network. Directed Diffusion performs the simpler form ofwhat we call data aggregation. However, most of the times the user asks to retrieveinformation from a group of sensors that is placed at a specific area. An aggregate is avalue or set of values that provides information about a sensor group and can becomputed by the individual sensor data reading of each element/node of that group. Thecomputation is performed by using an aggregate function that takes as input the readingsof the field and gives to the output the respective aggregate. Examples of aggregatefunctions are SUM, MAX, MIN, AVERAGE etc.The typical communication pattern used in WSNs is the one that assumes thatmultiple sensors/sources send data only towards one sink (receiver) that is reachable bymultihop routes. The idea behind the data aggregation techniques is the combination ofthe received values on their way to the sink at each intermediate node. Thus, thecorresponding packet flow resembles a reverse-multicast structure, which is called thedata aggregation tree (spanning tree of the sensors that perform sensing tasks for a givenquery).To get an idea of how the in-network data aggregation achieves significant energysavings we consider the general approaches discussed at the end of the introduction.Direct delivery (Figure 9a):Each sensor value must be routed to the sink. If a sensor resides at depth n of therouting tree, this requires transmitting n-1 messages (1 transmission by himself and therest by the intermediate nodes of the path). The Sink after receiving all messagescomputes the aggregated value. In Figure 3a each node is labelled with the total number

of hops/messages needed to reach the sink. The total number of transmissions needed forthe sink to collect all the values is 1 + 2 + 3 + 3 + 3 + 4 = 16.Distributed in-network approach (Figure 9b):All nodes send local sensed values to their neighbour/parent. The value received bythe sink is the ready-computed aggregate. However, each node has to wait its childrenbefore computing the aggregate and forward it (the respond delay grows). Moreover, theamount of the transmitted data depends on the type of the aggregate function. The totalnumber of transmissions needed for the sink to collect the final aggregated value is 1 + 1+ 1 + 1 + 1 + 1 = 6.Observing the above example we note that the reduction of the transferred data isobvious (the distributed approach performs 10 transmissions less than direct delivery).This difference is translated into reduction of the energy consumption when using thedistributed approach.Figure 9. a)Direct Delivery, b) Distributed in-network aggregationBenefits from Distributed Data Aggregation1. Reduction in total transmission needed to be performed insidethe network.The sensor data are aggregated because they get fused at eachparental node with the parent’s local values and the received datafrom other children. Because of the fact that the user is not interestedin individual values there is no loss in the quality of the result returned.

2. Less packet collisions.Packet collisions occur more frequently when the network is veryloaded. Less transferred packets is translated into less packetcollisions.3. Less redundant readings.In the case of Direct Delivery, where all the individual readings aredelivered separately to the host node, it is a common phenomenonthat a node sends its values to multiple parents. This results inreception of the same packet multiple times by the sink. In dataaggregation, the filters drop the redundant packets (caching is used).4. Increase in accuracy of results.If a sensor node is temporarily down, its parent can estimate avirtual value based on node’s previous readings. Thus the totalaggregate value is not significantly affected. Of course, this benefitholds in slow-changing environments.3.2 Query ProcessingA user can send queries to the sensor network, in order to retrieve information aboutits state. The query is usually injected into the network from specific nodes (e.g. sink or agateway). There are several proposed approaches for syntax of a query, its lifetime, itsscope, its propagation and finally its parametrization.In Directed diffusion [3] the queries are formed as “interests” which are expressionswith multiple attribute-value pairs (Figure 10). Predefined attribute categories andsubcategories and application specific data representation characterize interests. Usuallydata are also named in a similar way and so, the application filters are able to performdata aggregation.Interest message Exampletype = wheeled vehicleinterval = 20 ms // send events every 20 msduration = 10 s // for the next 10 srect = [-100; 100; 200; 400] // from sensors within rectangle.Figure 10

A different method is used in TAG [10] to represent a user query is an SQL-likelanguage (Figure 11). The query is always performed over a single table called “sensors”.SELECT AVG(volume),room FROM sensorsWHERE floor = 6GROUP BY roomHAVING AVG(volume) > thresholdEPOCH DURATION 30sFigure 11Regarding the duration of the execution a query it is usually defined within its body(In Figure 10 the duration attribute and in Figure 11 the EPOCH clause).A query may be executed many times periodically (periodic, long-running) or it isexecuted only one time and returns just a snapshot of the current state of the network(snapshot queries).3.3 Aggregation operatorsAuthors of TAG [10] propose a nice taxonomy of the aggregation operators by usingseveral criteria:1. Duplicate sensitivity.This property specifies whether an aggregate function will returnthe same result when the dataset contains duplicate values. Examplesof duplicate sensitive aggregates are MEDIAN, AVERAGE, and COUNT.Examples of duplicate insensitive aggregates include MIN, MAX, andCOUNT DISTINCT.2. Exemplary/Summary.Exemplary aggregates always return a representative value presentin the dataset while summary aggregates perform some calculationover the entire dataset and return the calculated value. Summaryvalues (such as AVERAGE and COUNT) are more easily estimated evenin a network with losses, where all data packets are not received.

Exemplary aggregates, on the other hand, may be highly inaccurate ifeven a few messages are lost. Such aggregates include MIN, MAX, andMEDIAN.3. Monotonic aggregates.Aggregates that allow early testing of predicates in the network aremonotonic. For example, assume the user requests the MAXtemperature reading in the network. As source nodes report theirvalues toward the host node, other nodes may listen and only reporttheir own values if they are greater than the current MAX. Thisprovides savings in the overall number of messages sent through thenetwork without affecting the result.4. Partial state requirements.The amount of partial state information required differs amongaggregate functions. Aggregates such as SUM and COUNT requirepartial state records that are the same size as the final aggregate. TheAVERAGE function requires a partial state record containing two values(both the SUM and COUNT). Other aggregates such as MEDIAN andHISTOGRAM require that the entire dataset be returned to the hostnode unless some type of compression or estimation is used (see [10]).3.4 A Taxonomy of Current Aggregation Approaches3.4.1 Tree-Based TechniquesLow level Naming – Filter Based [4]In order to illustrate the way in which in-network processing can reduce data traffic toconserve energy, an example of filter-driven data aggregation using Directed Diffusion ispresented below. The scope of an anticipated sensor application is to query a number ofsensors so as to be able to take some action when one or more of the sensors is activated.Take as an example a surveillance system which could notify a biologist if an animalenters a specific region. In order to ensure robust coverage, the coverage of deployedsensors will overlap and as a result, one event is likely to trigger multiple sensors of thenetwork. Every sensor will report detection to the user whereas communication and

energy costs could be reduced by aggregating this data as it returns to the user. Thisaggregation could be performed to a binary value (a detection exists), an area (a detectionexists in quadrant 2), or it can be application specific (seismic and infrared sensorsindicate 80% chance of detection).In spite of the fact that details of aggregation can be application-specific, a commonproblem is the design of mechanisms for the establishment of data dissemination paths tothe sensors within the region, as well as for the aggregation of responses. Let’s nowconsider the way in which this kind of data fusion may be implemented in a traditionalnetwork where low-level node names are topologically assigned. Firstly, a bindingservice must exist, so as to list the node identifiers of sensors within a given region, inorder to determine which sensors are present in that region. Once these sensors aretasked, an election algorithm must choose dynamically one or more nodes, to aggregatethe data and return the result to the querier.On the other hand, Directed Diffusion [3] faces this problem by using opportunisticdata aggregation. The selection as well as the tasking of sensors is accomplished bynaming nodes using geographic attributes. As data is sent from the sensors to the querier,the intermediate sensors of the return path are able and do identify and cache relevantdata with the aid of application-specific filters. Also, they can then suppress duplicatedata, by simply not propagating it, or they may slightly delay and aggregate data frommultiple sources.Opportunistic aggregation strategies benefit from filters in various ways as theyprovide a natural approach for the insertion of application-specific code into the network.The naming and matching of attributes, allow these filters to stay inactive until they aretriggered by relevant data. The significance of a common attribute set is that filters incurno network costs to interact with directory or mapping services.In [4] one can find more complex examples concerning in-network aggregation usingfilters. Another important issue is that nested queries (where one sensor cues another) canbe used for triggering as well as to reduce overall energy consumption significantly. Forexample, the entrance of a person in a room is often correlated with changes in light ormotion. When implementing multi-modal sensor networks there is the ability to use thesecorrelations by triggering a secondary sensor based on the status of another, or in otherwords by nesting one query inside another. The overall energy consumption as well asthe network traffic can be reduced, by reducing the duty cycle of some sensors. Forexample, in the case of the energy consumption this happens if the secondary sensor

consumes more energy than the initial sensor (as an accelerometer triggering a GPSreceiver), whereas in the case of network traffic a triggered imager for example generatesmuch less traffic than a constant video stream. In other words, in-network processingmight choose the best application of a sparse resource (for example, a motion sensortriggering a steerable camera).COUGAR [5], [6]In order to enable declarative querying of sensor networks, in [5],[6] is proposed aquery layer consisting of a query proxy on every sensor node. Concerning thearchitecture of the sensor node, the query proxy lies between the network layer and theapplication layer and it provides higher-level services using queries that can be injectedinto the network from a specified gateway node.Cougar, enables the user to parameterize the queries. A complex query may not onlyconsist of a large number of parameters and operators but also various user requirementson the query answers, such as specification of a maximum permissible latency andaccuracy of the query result.Also, the query proxy is responsible for the data aggregation. Part of the computationcan be moved from a location outside the network and pushed into the sensor network,aggregating records, or eliminating irrelevant records. Compared to traditionalcentralized data extraction and analysis, in-network processing can reduce energyconsumption and improve sensor network lifetime significantly. This is the reason whyone of the main roles of the query proxy when processing user queries is to perform innetworkprocessing.Nevertheless, COUGAR has some drawbacks as well. First, the addition of querylayer on each sensor node may add an extra overhead in terms of energy consumptionand memory storage. Second, to obtain successful in-network data computation,synchronization among nodes is required (not all data are received at the same time fromincoming sources) before sending the data to the leader node. Third, the leader nodesshould be dynamically maintained to prevent them from being hot-spots (failure prone).As an example, suppose that we have a long-running query Q to monitor the averagetemperature of an office every t seconds. The query Q notifies the administrator of thenetwork if the average temperature in the office is greater than a user-defined threshold,

y generating an output record. Also, a query optimizer generates an efficient query planfor in-network processing of query Q, in order to vastly reduce resource usage and thusextend the lifetime of the sensor network. Furthermore, a query plan specifies both thedata flow (between sensors) and an exact computation plan (at each sensor). Thecomputation plan determines the leader of the specific query, a designated node wherethe computation of the average temperature will take place. The leader could be either afixed sensor with more remaining power and energy, or a randomly selected node bysome distributed leader election algorithm. Two computation plans are produced, one forthe leader node, and a second plan for the remaining nodes in the query region. A queryoptimizer can also perform several techniques to improve the performance of the system.For example, it can merge a new query with existing similar queries. It is important tomention that in order to generate a good plan for a user query, the optimizer requiresmetadata about the status of the sensor network to evaluate the costs acid benefits(latency and accuracy) of different plans.Data Aggregation in COUGAR involves two important issues:a) from a computational point of view, the aggregation must take place at a "leader"node [leader election problem + dynamically maintain a leader + leader with physicallyadvantageous location], unless the final computation of the aggregate is delegated to agateway node or happens outside of the network.b) data records have to be delivered from source sensor nodes to the designated leadersimply by sending data records directly to the leader along multi-hop routes so as all thecomputation takes place directly at the leader or alternatively by pushing partialcomputation from the leader to internal nodes along the path in order to reduce data sizeon-the-fly. Synchronization between sensor nodes along the communication path is veryimportant, since a node has to "wait" to receive the results that are to be aggregated.TAG [10]In the TAG system, the connection of the users to the sensor network is achieved byusing a workstation or a base station which is directly connected to a sensor and has therole of the root node. The aggregation of queries over the data of the sensors isformulated using a simple SQL-like language. The in-network query evaluation consistsof two phases, the (smart) distribution and the collection phase. During the distributionphase, the query is flooded in the network and organizes the nodes into an Aggregation

Tree. To be more specific, as the query is distributed across the network, a spanning treeif formed from the sensors, in order to return data back to the root node. During the datacollection (sensing) phase, each leaf node produces a single tuple and forwards it to itsparent. The non-leaf nodes receive the tuples of their children and combine these values.Afterwards, they submit the new partial results to their own parents. Finally, the totalresult will arrive at the root after h steps, where h is the height of the aggregation tree. Ifthere are no failures, this technique works extremely well for decomposable aggregates,namely distributive and algebraic aggregates such as MIN, MAX, COUNT and AVG.When a query involves an epoch, requiring readings to be collected periodically, TAGuses the periodic per-hop adjusted aggregation approach. It subdivides the epoch intoslots. The length of each slot is equal to the epoch length divided by n, where n is themaximum number of hops separating the nodes that generate data from the sink. By usingthe per-hop adjusted aggregation operation, slots are assigned to nodes in decreasingorder (i.e. n, n-1, n-2,…) as the query propagates through the network. Each nodetransmits in its slot thus, the nodes that transmit first are the outmost nodes whereas thenodes that transmit last are those that are closest to the sink. As in any time-slottedmechanism, clock synchronization among nodes is required so that nodes transmit intheir designated slots.Nevertheless, the tree-based approach of TAG breaks down when failures areintroduced into the system. Especially in sensor networks, both node and link failures arevery common phenomena. Node failures are expected to be relatively frequent, since thesensors are meant to be small, cheap as well as mass-produced and they will be placed ina variety of uncontrolled environments. Link failures (and packet losses) are alsoexpected to occur very often due to environmental interference, packet collisions, andlow signal-to-noise ratios. Furthermore, if a node fails or its message does not reach itsparent, the values associated with the entire subtree are lost. . If the failure occurs close tothe root node, then the effect on the resulting aggregate can be significant.In order to improve the performance of the TAG service, several optimizations havebeen proposed with significant results. Concerning the conservation of energy, sensornodes sleep as much as possible during each step where the processor and radio are idle.When a timer expires or an external event occurs, the device wakes and starts to performthe processing and communication phases. At this point, as mentioned above, it receivesthe messages from its children and then submits the new value(s) to its parent. If no more

processing is needed for this step, the node enters again the sleeping mode [11]. Whilethis approach is satisfactory suitable for ideal network conditions, Madden et al. [10]proposed some methods with the goal of improving the performance of their system. Onesolution is to cache previous values and reuse them if newer ones are unavailable but ofcourse, the previous values may reflect losses at lower levels of the tree.Pipelined Aggregation [7]The authors of [7] proposed a fully-pipelined approach for aggregation. Also, as thisapproach is an ancestor of TAG, it has many similarities with it. Both of them assumethat a sensor network is a distributed database. Furthermore, it implements a generalpurpose,SQL-style interface that can execute queries over any type of sensor data whileproviding the abilities for significant optimization.Each sensor possesses a unique id that is used just for local interactions and not inorder to route the sensed data back to the sink as expected. It does not use a data-centricrouting mechanism but it forms a routing tree upon query distribution, just as the casewas in TAG. Again, one sensor plays the role of the root point, upon which, aggregateddata will converge and also it is the interface of the querying user for the rest of thenetwork.The computation of an aggregate consists of two phases: the propagation phase, inwhich aggregate queries are pushed down into sensor networks, and the aggregationphase, in which the aggregate values are propagated up from children to parents. Whiledata flow back to the top of the tree (to sink/root), all the sensors that have any childrenmust wait for their responses before computing the local aggregate and then they canforward it to its own parents. There is a timeout t for waiting to hear a propagate messagefrom a child to bound in-network delay. Choosing a small value for t may result inmissed reports from children. Also, it has to be noted that the proper value of t varies,depending on the depth of the routing tree.This simple approach would work fine if the nature of a WSN was not unreliable. Theunreliability affects the accuracy of the returned aggregated results/values. Here comesthe pipelined aggregation to give the solution. The propagation of aggregates is done asdescribed above except that time is divided in intervals of duration i. During each iinterval:

a) Sensors that heard aggregate request start to transmit to their parent a fused valueof:• the partial aggregate they have already received• the aggregates that received by their children• their local readingb) After the elapse of the first i-duration, the root will have received messages thatcontain aggregates from sensors that are only 1-hop away. After the elapse of the 2ndduration the root will have received messages with aggregates from sensors that are 2-hops away and so on.In the pipelined approach that was described above, a new aggregate arrives every iseconds. In this manner, the user that injected the query can have a first draft estimationof the value of the aggregate (faster first response) whereas every i seconds the user willget a more accurate result. Consequently, the technique of the pipelined aggregationprovides users with a stream of aggregate values that changes, as sensor readings and theunderlying network change.The most significant disadvantage of this approach is that a number of additionalmessages are transmitted, in order to extract the temporary aggregate values every i-timeinterval over all the sensors of the network. However, despite these negative effectssimulation shows that this technique generally improves the robustness as well as thethroughput of the network.In [7] several optimizations are proposed:1) Snooping: Takes advantage of the shared radio channel through which everymessage is broadcasted to the other sensors of the network within the range. When asensor misses an initial request to aggregate, it may start to initiate the aggregationactivity whenever it listens to another sensor, reporting an aggregate value to its ownparents. Going one step forward, a sensor does not need to explicitly tell its children tobegin aggregation. It can simply report its value to its parents, as this value will also beheard by its children. The children will assume they missed the start request and as aresult they will initiate aggregation locally. It is important to mention that Snoopingreduces the total messages required to compute the first full aggregate of the network fora total savings of 23%.

2) Use of Multiple parents: Due to the broadcast nature of radio, data redundancy isvery often and while from one perspective this provides more reliability, on the otherhand in MAX-MIN a node may be counted multiple times. A possible solution to thisproblem is to send part of the aggregate to one parent and the rest to the other. In thatway, there is the benefit that the variance of the multiple parent COUNT is much less,although its expected value is the same. In other words, a single loss will affect less thecomputed value.3) Hypothesis Testing: The above optimizations still require an input from every nodeof a network in order to compute an aggregate. Sometimes we need to hear from aparticular sensor its sensed value in order to figure out if it will affect the value of theaggregate. The node decides locally whether contributing its reading and the readings ofits children will affect the value of the aggregate.For the case of MIN(or MAX) hypothesis testing can be done even with snooping orpipelined aggregate over the k-first levels of the aggregate tree.For COUNT/SUM/AVERAGE the user defines an error_bound. If the value of theleaf is within the defined error_bound, then is remains silent and its parent keeps the oldapproximate answer, else the child forwards its own value to its parent.3.4.2 Multipath TechniquesRecently the research community of WSNs has turned its interest from the tree basedaggregation schemes to multi-path graphs where multiple edges can exist between twonodes. This field is continuously gaining more attention and the reason for this is the factthat the data which is sent from a sensor to the others can be easily lost due to the largecommunication error rate of wireless communication.The rationale behind multi-path graph is as follows: instead of requiring from eachnode to send its accumulated partial result to its single parent in an aggregation tree, themulti-path approach exploits the characteristics of the wireless broadcast medium byforcing each node to broadcast its partial result to multiple neighbors. Furthermore, thisapproach sends the same minimal number of messages as the tree approach (i.e., onetransmission per node), which makes it energy-efficient. Also it is very robust in means

of communication failures because each reading is accounted for in many paths towardsthe base station, and all would have to fail for the reading to be lost. However, there aretwo drawbacks for the multi-path approach:(1) when many aggregates are to be accumulated, the known energy-efficienttechniques provide only an approximate answer (with accuracy guarantees), and(2) when some aggregates are to be accumulated, the message size is longer thanwhen using the tree approach, thereby consuming more energy.(3) there is the danger to double count the same values because of the multipleparents.In the following, it is proposed a protocol which recommends a topology called rings.In this topology, nodes are divided into levels according to their hop count from the basestation, and the multi-path aggregation is performed level-by-level towards the basestation.SYNOPSIS DIFFUSION [8]In [8] it is proposed a general framework for achieving significantly more accurateand reliable answers by combining energy-efficient multi-path routing schemes withtechniques that avoid double-counting. Synopsis Diffusion uses a rings topology toperform in-network aggregation. The partial result at a node is represented as a synopsis,a small digest (e.g., histogram, bit-vectors, sample, etc.) of the data.The synopsis diffusion algorithm consists of two phases:a) a distribution phase in which the aggregate query is flooded through the networkand an aggregation topology is constructed, andb) an aggregation phase where the aggregate values are continually routed towardsthe querying node.Distribution PhaseIn order to construct a rings topology, firstly, the base station transmits and any nodehearing this transmission is in ring 1. At each subsequent step, nodes in ring i transmitand any node, apart from those that are already in a ring, hearing one of thesetransmissions, is in ring i + 1. As a result, the ring number defines the level of a node in

the rings topology. Aggregation proceeds level-by-level, with level i + 1 nodestransmitting, while level i nodes are listening. In contrast to trees, the rings topologyexploits the wireless broadcast medium by forcing all level i nodes that hear a level i + 1partial result, to incorporate that result into their own. In that way, there is a significantincrease in robustness, because each reading is accounted for in many paths towards thebase station, and all would have to fail for the reading so as not to be accounted for in thequery result. As with trees, nodes can monitor link quality and level changes aswarranted. A key advantage of using a rings topology is that the communication error istypically very low, in stark contrast with trees. Moreover, the rings approach is as energyefficientas the tree approach (within 1%). Nevertheless, because each partial result isaccounted for in multiple other partial results, special techniques are required to avoiddouble-counting.Aggregation PhaseThe aggregate computation is defined by three functions on thesynopses:• Synopsis Generation:A synopsis generation function SG(arg) takes a sensor reading(including its metadata) and generates a synopsis representing thatdata.• Synopsis Fusion:A synopsis fusion function SF(arg1, arg2) takes two synopses andgenerates a new synopsis.• Synopsis Evaluation:A synopsis evaluation function SE(arg) translates a synopsis into thefinal answer.The exact details of the functions SG(), SF(), and SE() depend on theparticular aggregate query to be answered.During the aggregation phase, each node periodically uses the function SG() in orderto convert sensor data to a local synopsis and the function SF() so as to merge twosynopses to create a new local synopsis. For example, whenever a node receives asynopsis from a neighbour, it may update its local synopsis by applying SF() to itscurrent local synopsis and the received synopsis. Finally, the querying node uses the

function SE() to translate its local synopsis to the final answer. The continuous querydefines the desired period between successive answers, as well as the overall duration ofthe query.Double-Counting ProblemAs we have already mentioned, multipath techniques usually double count the samevalues because of the multiple parents of some nodes. Nevertheless, Synopsis diffusionavoids double-counting through the use of order and duplicate-insensitive (ODI)synopses that compactly summarize intermediate results during in-network aggregation.Main drawbackThe total number of massages that are exchanged between the sensor nodes during theaggregation phase is a lot higher comparing to that of a simple tree-based approach.Therefore, the “synopsis diffusion” approach is more robust-oriented that energyefficient.3.4.3 Distributed Localized AlgorithmsUsing Distributed Estimation [19]Boulis et. al in [19] proposed a distributed localized algorithm that explores theenergy/accuracy subspace for the periodic aggregation domain. They firstly separated thein-network processing algorithms in two different types.Processing Type-I: Includes all the “snapshot aggregation” solutions that calculate anaggregate by simply combining the values at multiple intermediate nodes until the finalvalue reaches the sink (this type includes all the approaches that have been reviewed sofar).Processing Type-2: Includes the approaches that take into account that the sensedvalues are just approximations of the real ones and moreover, they do not try to calculatean aggregate but they rather attempt to achieve a “good” estimation of it.

The proposed solution of paper [19] belongs to the second type and is concerned withthe case of the MAX aggregation function.Every node keeps an estimation of the global aggregated value in the form of a vectorV = . These estimations change dynamically with time as the nodesinteract with each other by sending to their neighborhood their own estimations. Thereare two ways that a local view (V) of the global estimation can change:a) After the reception of a global estimate of one of our neighbors (Module A,Figure 12).b) After a new value has been generated by the local sensors (Module B, Figure 12).It is beyond the scope of this paper to analyze the mechanism of local and globalestimation fusion, however our interest is focused on the way that the fused value (V’) isdecided or not to be transmitted to the others (Modules C, D, Figure 12). Each nodekeeps an estimation table with the most recent estimations (V 1 ,V 2 ,…,V n ) that are receivedfrom each one of their neighbor. It combines V’ with each one of V 1 ,V 2 ,…,V n separatelyto create the temporary (virtual) new neighbor estimations V 1 ’,V 2 ’,…,V n ’. Given athreshold parameter (Th - user defined) of the maximum difference it does the following:a) If (|V 1 ’- V 1 | > Th) then it broadcasts V’ as it is likely that it will affect seriouslythe others’ estimationsb) If (|V 1 ’- V 1 | < Th) then do nothing because V’ will not affect seriously the others’estimationsFigure 12. The modular structure of the algorithm

Benefits1. It is a completely distributed and localized solution for the periodic aggregationproblem and is not error prone to node failures (unlike the solutions that make use ofaggregation-trees) due to the broadcast nature of the communications that are performedbetween the nodes.2. Simulations showed that it achieves significant energy savings for some thresholdparameters (Th).DrawbacksIt needs the exchange of big volumes of information between the nodes when thecomputing aggregate is more complex than the cases of MAX/MIN (e.g. MEDIAN,HISTOGRAM etc.).Gossip-Based Computation of aggregates [9]In this paper [9], a new theoretical method is proposed for the computation ofaggregates (like sums, averages, random samples) with gossip-style protocols. But whyshould someone use a gossip based protocol? Here are some characteristics of them thatare enough to convince someone about the benefits from using them:a) It is known that gossip-based algorithms perform well in terms of scalability.b) Each node communicates with one or a few nodes of its neighborhood, thus, onlylocal information is neededc) Simplicity of design and operation. Usually, gossip-style protocols do not requireerror recovery mechanisms, while often incurring only moderate overhead, compared tooptimal deterministic protocols such as the construction of data dissemination treesd) They achieve high stability under stress and disruptions. In comparison,traditional techniques have absolute guarantees, but are unstable or fail to make progressduring periods of even modest disruption.It is pointed out that the point-to-point Uniform Gossip protocol is not suitable forwireless sensor networks. Instead, an alternative distributed broadcast-based algorithm isproposed, namely flooding. The convergence of the flooding algorithm is analyzed by

using the mixing time of the random walk on the underlying graph. It is assumed that theunderlying graph is ergodic and reversible [discrete events theory], hence their algorithmsmay not converge on many natural topologies such as Grid. However, the algorithm runsvery fast (logarithmic complexity class) in certain graphs. Under the assumption of acomplete graph, their analysis shows that when having high probability, the values at allnodes converge exponentially fast to the true (global) average.Advantages of the proposed protocolAs it was expected, the algorithm is very simple due to the use of gossip basedcommunications. The speed of convergence is very high, due to the fact that after a fewrounds/iterations of the algorithm, the estimated value of the aggregate, which iscomputed by the algorithm, gives a good approximation of the real value. What is more,it can automatically adjust itself upon joins and disjoins of the nodes.DrawbacksThe number of rounds that are needed in order the estimation of the aggregate (e.g.average) to be close to the real value (converge) is a critical information. Nodes though,haven’t a global view of the network (the algorithm is localized), so it is difficult to knowa priori after how many rounds they can stop running the algorithm. An easy way forstopping the execution of the algorithm is to let the node which initiates the aggregatequery to send a stop message to cease the computation. The querying node firstly samplesand compares the values from different nodes located at different locations. If thesampled values are all the same or within some satisfiable accuracy range, the queryingnode disseminates the stop messages. This method incurs a delay overhead on thedissemination while the lack of a purely distributed local stop mechanism on each node isobvious.3.5 Exposing Trade-Offs in data-aggregation schemesEnergy-AccuracyRecalling from section 4.1.3 the algorithm that Boulis et. al [19] proposed, it is easyto realize that this approach creates a system-level energy vs. accuracy knob. If an

application can tolerate big estimation errors then it has to set a high value to thethreshold parameter “Th”. This will lead to significant energy savings due to the smallertotal number of transmissions of the global estimates. A node now is more difficult todecide to broadcast its new estimation. However, such a “Th” value will reduce theaccuracy of the algorithm especially in environments where the values of the sensed datachange very fast. Similarly, the effects of selecting a low threshold “Th” will lead to highaccuracy and unfortunately to high energy consumption.Energy-DelayIn-network data aggregation, as we have already seen, reduces the traffic inside thesensor network and therefore the consumed energy is less than without performing it.This fact has been confirmed by several simulations and experiments carried out byseveral research groups ([3], [6], [7], [10]). However, the cost of in network processing,especially in the case of the tree-based approaches, is the higher delay which isproportional to the depth of the aggregation tree. This happens mainly because the nodesat higher levels (closer to the root) have to wait for the aggregates of the big sub-treesunder them, before them forward up to the top their own values. Moreover, the timeneeded to search the cache (matching) in order to suppress the duplicates ([3], [4]) is notalways negligible.Data freshness - accuracyThe longer a sink-node waits for the calculation of an aggregate, the more readings itis likely to receive (from the most of the nodes of interest) and thus, the received valuewill be more accurate. On the other hand, waiting too long may result in stale data,especially in the case of periodic aggregation or when frequent environmental changesoccur.

4 ConclusionsWireless sensor networks are a new area of research, having many open issues andchallenges due to the strict constraints of their resources. Although multihopcommunication protocols of mobile Ad-Hoc networks have many solutions to propose,only a few of them are applicable for a WSN. In this paper we focused on two goals. Thefirst one was to introduce the reader to the networking issues of a WSN and point out thenecessity of data naming and data-centric routing approaches. The second goal was topresent several data-centric oriented techniques that achieve reduction of traffic inside thenetwork and thus save energy. These techniques are referred to as Data Aggregationtechniques and critical past research has been devoted to them. Although many of thesetechniques look promising, there are still many challenges that need to be addressed.

REFERENCES[1] W. Ye, J. Heidemann, D. Estrin. “An Energy-Efficient MAC Protocol for WirelessSensor Networks”. In Proceedings of IEEE Infocom ‘02, New York, New York, June23-27, 2002.[2] Ilker Demirkol, Cem Ersoy, and Fatih Alagöz. “MAC Protocols for Wireless SensorNetworks: a Survey”. IEEE Communications Magazine, 2005.[3] C. Intanagonwiwat, R. Govindan, , and D. Estrin. “Directed diffusion: A scalableand robust communication paradigm for sensor networks”. In Proceedings of theSixth Annual International Conference on Mobile Computing and Networks(MobiCOM 2000), Boston, MA, August 2000.[4] Heidemann, F. Silva, C. Intanagonwiwat, R. Govindan, D. Estrin, and D. Ganesan.“Building efficient wireless sensor networks with low-level naming”. In SOSP,October 2001.[5] P.Bonnet, J.Gehrke, and P.Seshadri. “Towards sensor database systems”. In 2ndInternational Conference on Mobile Data Management, Hong Kong, January 2001.[6] Y. Yao and J. Gehrke, “The cougar approach to in-network query processing insensor networks”, in SIGMOD Record, September 2002.[7] S. Madden, R. Szewczyk, M. J. Franklin, and D. Culler, Supporting AggregateQueries Over Ad-Hoc Wireless Sensor Networks", In Proc. of WMCSA, June 2002.[8] S. Nath, P. B. Gibbons, S. Seshan, and Z. Anderson. Synopsis diffusion for robustaggregation in sensor networks. In SenSys, 2004. S. Nath, P. B. Gibbons, S. Seshan,and Z. Anderson.[9] D. Kempe A. Dobra J. Gehrke, “Gossip-based Computation of AggregateInformation”, Proc. The 44th Annual IEEE Symp. on Foundations of ComputerScience, FOCS October 2003.[10] S. Madden, M. Franklin, J. Hellerstein, and W. Hong. “TAG: A tiny aggregationservice for ad-hoc sensor networks”. In Proc. 5th Symp. on Operating SystemsDesign and Implementation, December 2002.[11] Madden, S., Franklin, M., Hellerstein, J., and Hong, W. The design of anacquisitional query processor for sensor networks. In Proceedings of ACMSIGMOD June 2003 (San Diego, June 9–12). ACM Press, New York, 2003, 491–502.[12] Eugene Shih, Seong-Hwan Cho, Nathan Ickes, Rex Min, Amit Sinha, Alice Wang,and Anantha Chandrakasan, “Physical layer driven protocol and algorithm designfor energy-efficient wireless sensor networks,” in The Seventh Annual InternationalConference on Mobile Computing and Networking 2001, July 2001, pp. 272 – 287.

[13] Jaap C. Haartsen and Sven Mattisson, Bluetooth — a new low-power radio interfaceproviding short-range connectivity, Proc. IEEE, v. 88, n. 10, October 2000, pp.1651–1661.[14] JP Lynch, KJ Loh , “A Summary Review of Wireless Sensors and Sensor Networksfor Structural Health Monitoring” - The Shock and Vibration Digest, 2006[15] LAN MAN Standards Committee of the IEEE Computer Society, Wireless LANmedium access control (MAC) and physical layer (PHY) specification,IEEE, NewYork, NY, USA, IEEE Std 802.11-1997 edition, 1997.[16] Mark Stemm and Randy H Katz, “Measuring and reducing energy consumption ofnetwork interfaces in hand-held devices,” IEICE Transactions on Communications,vol. E80-B, no. 8, pp. 1125–1131, Aug. 1997.[17] Alaa Muqattash and Marwan Krunz, “CDMA-Based MAC Protocol for WirelessAd Hoc Networks”, MobiHoc’03, June 1–3, 2003, Annapolis, Maryland, USA.[18] Jamal N. Al-Karaki Ahmed E. Kamal, “Routing Techniques in Wireless SensorNetworks: A Survey”[19] A.Boulis, S. Ganeriwal and M.B. Srivastava. “Aggregation in Sensor Networks: AnEnergy-Accuracy Trade-off”. Sensor Network Protocols and Applications (SNPA'03), May 2003.

In Network Processing and Data Aggregation in

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?