Big Data Analytics

Recommendations

Info

Massive Data Analysis: Tasks, Tools, . . . 33 machine crashes during a MapReduce job in March 2006 [12] and at a minimum one disk failure in each execution of a MapReduce job with 4000 tasks. Because of the criticality of failure, any resource allocation method for Big data jobs should be fault-tolerant. MapReduce was originally designed to be robust against faults that commonly happen at large-scale resource providers with many computers and devices such as network switches and routers. For instance, reports show that during the first year of a cluster operation at Google there were 1000 individual machine failures and thousands of hard-drive failures. MapReduce uses logs to tolerate faults. For this purpose, the output of Map and Reduce phases create logs on the disk [39]. In the event that a Map task fails, it is reexecuted with the same partition of data. In case of failure in Reducer, the key/value pairs for that failed Reducer are regenerated. 6.3 Replication in Big Data Big data applications either do not replicate the data or do it automatically through a distributed file system (DFS). Without replication, the failure of a server storing the data causes the re-execution of the affected tasks. Although the replication approach provides more fault-tolerance, it is not efficient due to network overhead and increasing the execution time of the job. Hadoop platform provides user a static replication option to determine the number of times a data block should replicate within the cluster. Such a static replication approach adds significant storage overhead and slows down the job execution. A solution to handle this problem is the dynamic replication that regulates the replication rate based on the usage rate of the data. Dynamic replication approaches help to utilize the storage and processing resources efficiently [64]. Cost-effective incremental replication [34] is a method, for cloud-based jobs, that is capable of predicting when a job needs replication. There are several other data replication schemes for Big data applications on clouds. In this section, we discuss four major replication schemes namely, Synchronous and Asynchronous replication, Rack-level replication, and Selective replication. These replication schemes can be applied at different stages of the data cycle. Synchronous data replication scheme (e.g., HDFS) ensures data consistency through blocking producer tasks in a job until replication finishes. Even though Synchronous data replication yields high consistency, it introduces latency, which affects the performance of producer tasks. In Asynchronous data replication scheme [32], a producer proceeds regardless of the completion of a producer task on a replica of the same block. Such a nonblocking nature of this scheme improves the performance of the producer. But consistency of Asynchronous replication is not as precise as the Synchronous replication. When Asynchronous replication is used in the Hadoop framework, Map and Reduce tasks can continue concurrently.
34 M.K. Pusala et al. Rack-level data replication scheme ensures that all the data replicas occur on the same rack in a data center. In fact, in data centers, servers are structured in racks with a hierarchical topology. In a two-level architecture, the central switch can become the bottleneck as many rack switches share it. One instance of bandwidth bottleneck is in the Shuffling step of MapReduce. In this case, the central switch becomes over utilized, whereas rack-level switches are under-utilized. Using the Rack-level replication helps reduce the traffic that goes through the central switch. However, this schema cannot tolerate rack-level failures. But recent studies suggest, the rack-level failures are uncommon, which justifies the adoption of Rack-level replication. In Selective data replication, intermediate data generated by the Big data application are replicated on the same server where they were generated. For example, in the case of Map phase failures in a chained MapReduce job, the affected Map task can be restarted directly, if the intermediate data from previous Reduce tasks were available on the same machine. The Selective replication scheme reduces the need for replication in the Map phase. However, it is not effective in Reduce phase, since the Reduce data are mostly consumed locally. Data replication on distributed file systems is costly due to disk I/O operations, network bandwidth, and serialization overhead. These overheads can potentially dominate the job execution time [65]. Pregel [37], a framework for iterative graph computation, stores the intermediate data in memory to reduce these overheads. Spark [66] framework uses a parallel data structures known as Resilient Distributed Datasets (RDDs) [65] to store intermediate data in memory and manipulate them using various operators. They also control the partitioning of the data to optimize the data placement. 6.4 Big Data Security In spite of the advantages offered by Big data analytics on clouds and the idea of Analytics as a Service, there is an increasing concern over the confidentiality of the Big data in these environments [6]. This concern is more serious as increasing amount of confidential user data are migrated to the cloud for processing. Genome sequences, health information, and feeds from social networks are few instances of such data. A proven solution to the confidentiality concerns of sensitive data on cloud is to employ user-side cryptographic techniques for securing the data [6]. However, such techniques limit the cloud-based Big data analytics in several aspects. One limitation is that the cryptographic techniques usually are not transparent to end users. More importantly, these techniques restrict functionalities, such as searching and processing, that can be performed on the users’ data. Numerous research works are being undertaken to address these limitations and enable seamless Big encrypted data analytics on the cloud. However, all of these efforts are still in their infancy and not applicable to Big data scale processing.
Page 1 and 2: Saumyadipta Pyne · B.L.S. Prakasa
Page 3 and 4: Saumyadipta Pyne ⋅ B.L.S. Prakasa
Page 5 and 6: Foreword Big data is transforming t
Page 7 and 8: viii Preface We thank all the autho
Page 9 and 10: x Contents Managing Large-Scale Sta
Page 11 and 12: xii About the Editors biological, n
Page 13 and 14: 2 S. Pyne et al. ethics. If data po
Page 15 and 16: 4 S. Pyne et al. Therefore, as data
Page 17 and 18: 6 S. Pyne et al. characteristics. I
Page 19 and 20: 8 S. Pyne et al. stays of high-dime
Page 21 and 22: 10 S. Pyne et al. References 1. Ken
Page 23 and 24: 12 M.K. Pusala et al. 1 Introductio
Page 25 and 26: 14 M.K. Pusala et al. to obtain hid
Page 27 and 28: 16 M.K. Pusala et al. to identify t
Page 29 and 30: 18 M.K. Pusala et al. duce works cl
Page 31 and 32: 20 M.K. Pusala et al. (ETL), as wel
Page 33 and 34: 22 M.K. Pusala et al. According to
Page 35 and 36: 24 M.K. Pusala et al. 4.2.3 Documen
Page 37 and 38: 26 M.K. Pusala et al. also offers s
Page 39 and 40: 28 M.K. Pusala et al. The Real-time
Page 41 and 42: 30 M.K. Pusala et al. Fig. 4 Implem
Page 43: 32 M.K. Pusala et al. However, thes
Page 47 and 48: 36 M.K. Pusala et al. 7 Summary and
Page 49 and 50: 38 M.K. Pusala et al. 15. Enhancing
Page 51 and 52: 40 M.K. Pusala et al. 64. Yuan D, C
Page 53 and 54: 42 A. Laha ability to query the dat
Page 55 and 56: 44 A. Laha as positive, negative or
Page 57 and 58: 46 A. Laha 3.2 Velocity Some of the
Page 59 and 60: 48 A. Laha angular data), variation
Page 61 and 62: 50 A. Laha that it creates a histog
Page 63 and 64: 52 A. Laha It is seen that the ASR
Page 65 and 66: 54 A. Laha However, better methods
Page 67 and 68: Application of Mixture Models to La
Page 85 and 86: An Efficient Partition-Repetition A
Page 95 and 96:
An Efficient Partition-Repetition A
Page 97 and 98:
Page 99 and 100:
Page 101 and 102:
Page 103 and 104:
Page 105 and 106:
96 X. Chen and J. Huan 1 Introducti
Page 107 and 108:
98 X. Chen and J. Huan Hash or rand
Page 109 and 110:
100 X. Chen and J. Huan Fig. 1 Vert
Page 111 and 112:
102 X. Chen and J. Huan According t
Page 113 and 114:
104 X. Chen and J. Huan Fig. 3 Assi
Page 115 and 116:
106 X. Chen and J. Huan The ADG alg
Page 117 and 118:
108 X. Chen and J. Huan partition o
Page 119 and 120:
110 X. Chen and J. Huan Table 5 Ave
Page 121 and 122:
112 X. Chen and J. Huan orders. At
Page 123 and 124:
114 X. Chen and J. Huan 10. Ng AY (
Page 125 and 126:
116 Y. Simmhan and S. Perera will b
Page 127 and 128:
118 Y. Simmhan and S. Perera human-
Page 129 and 130:
120 Y. Simmhan and S. Perera IoT sy
Page 131 and 132:
122 Y. Simmhan and S. Perera Fig. 1
Page 133 and 134:
124 Y. Simmhan and S. Perera Finall
Page 135 and 136:
126 Y. Simmhan and S. Perera detect
Page 137 and 138:
128 Y. Simmhan and S. Perera Fig. 3
Page 139 and 140:
130 Y. Simmhan and S. Perera perfor
Page 141 and 142:
132 Y. Simmhan and S. Perera get fe
Page 143 and 144:
134 Y. Simmhan and S. Perera Refere
Page 145 and 146:
Complex Event Processing in Big Dat
Page 147 and 148:
Page 149 and 150:
Page 151 and 152:
Page 153 and 154:
Page 155 and 156:
Page 157 and 158:
Page 159 and 160:
Page 161 and 162:
Page 163 and 164:
Page 165 and 166:
Page 167 and 168:
Page 169 and 170:
Page 171 and 172:
164 C. Hota et al. security and pri
Page 173 and 174:
166 C. Hota et al. The network is p
Page 175 and 176:
168 C. Hota et al. Fig. 5 Web usage
Page 177 and 178:
170 C. Hota et al. the P2P network
Page 179 and 180:
172 C. Hota et al. authors in [45]
Page 181 and 182:
174 C. Hota et al. Table 1 Applicat
Page 183 and 184:
176 C. Hota et al. Fig. 7 Flow-base
Page 185 and 186:
178 C. Hota et al. applications lik
Page 187 and 188:
180 C. Hota et al. Fig. 10 Botnet T
Page 189 and 190:
182 C. Hota et al. The build time o
Page 191 and 192:
184 C. Hota et al. With the edge-pa
Page 193 and 194:
186 C. Hota et al. 24. Liang J, Kum
Page 195 and 196:
Application-Level Benchmarking of B
Page 197 and 198:
Page 199 and 200:
Page 201 and 202:
Page 203 and 204:
Page 205 and 206:
Page 207 and 208:
202 S. Batra and S. Sachdeva 1 Intr
Page 209 and 210:
204 S. Batra and S. Sachdeva In 201
Page 211 and 212:
206 S. Batra and S. Sachdeva 2.2 Du
Page 213 and 214:
208 S. Batra and S. Sachdeva One mo
Page 215 and 216:
210 S. Batra and S. Sachdeva Retrie
Page 217 and 218:
212 S. Batra and S. Sachdeva 3.4 Op
Page 219 and 220:
214 S. Batra and S. Sachdeva inform
Page 221 and 222:
216 S. Batra and S. Sachdeva Fig. 4
Page 223 and 224:
218 S. Batra and S. Sachdeva 7. IHT
Page 225 and 226:
Microbiome Data Mining for Microbia
Page 227 and 228:
Page 229 and 230:
Page 231 and 232:
Page 233 and 234:
Page 235 and 236:
Page 237 and 238:
Page 239 and 240:
Page 241 and 242:
238 K. Das and Z. Nenadic a viable
Page 243 and 244:
240 K. Das and Z. Nenadic band [47,
Page 245 and 246:
242 K. Das and Z. Nenadic (a) (b) C
Page 247 and 248:
244 K. Das and Z. Nenadic Fig. 3 Tw
Page 249 and 250:
246 K. Das and Z. Nenadic membershi
Page 251 and 252:
248 K. Das and Z. Nenadic and are e
Page 253 and 254:
250 K. Das and Z. Nenadic Fig. 6 Ex
Page 255 and 256:
252 K. Das and Z. Nenadic 3.2.3 Dis
Page 257 and 258:
254 K. Das and Z. Nenadic 3. Belhum
Page 259 and 260:
256 K. Das and Z. Nenadic 50. Ramos
Page 261 and 262:
Big Data and Cancer Research Binay
Page 263 and 264:
Big Data and Cancer Research 261 sy
Page 265 and 266:
Big Data and Cancer Research 263 Ta
Page 267 and 268:
Big Data and Cancer Research 265 in
Page 269 and 270:
Big Data and Cancer Research 267 Id
Page 271 and 272:
Big Data and Cancer Research 269 us
Page 273 and 274:
Big Data and Cancer Research 271 Ac
Page 275 and 276:
Big Data and Cancer Research 273 39
Page 277 and 278:
Big Data and Cancer Research 275 75
show all

Big Data Analytics

Create successful ePaper yourself

Delete template?

Save as template?