348957348957

More documents

Recommendations

Info

FIGURE 6-2: A simple scatter plot, showing eyeballed estimations of clustering. Although you can generate visual estimates of clusters, you can achieve much more accurate results when dealing with much larger datasets by using algorithms to generate clusters for you. Visual estimation is a rough method that’s useful only on smaller datasets of minimal complexity. Algorithms produce exact, repeatable results, and you can use algorithms to generate clusters from multiple dimensions of data within your dataset. Clustering algorithms are appropriate in situations where the following characteristics are true: You know and understand the dataset you’re analyzing. Before running the clustering algorithm, you don’t have an exact idea of the nature of the subsets (clusters). Often, you don’t even know how many subsets are in the dataset before you run the algorithm. The subsets (clusters) are determined by only the single dataset you’re analyzing. Your goal is to determine a model that describes the subsets in a single dataset and only this dataset. If you add more data to your dataset after you’ve already built the model, be sure to rebuild the model from scratch to get more complete and accurate model results.
Looking at clustering similarity metrics Classification methods are based on calculating the similarity or difference between two observations. If your dataset is numeric — composed of only numerical features — and can be portrayed on an n-dimensional plot, you can use various geometric metrics to scale your multidimensional data. An n-dimensional plot is a multidimensional scatter plot that you can use to plot n number of dimensions of data. Some popular geometric metrics used for calculating distances between observations are simply different geometric functions that are useful for modeling distances between points: Euclidean metric: A measure of the distance between points plotted on a Euclidean plane. Manhattan metric: A measure of the distance between points where distance is calculated as the sum of the absolute value of the differences between two points’ Cartesian coordinates. Minkowski distance metric: A generalization of the Euclidean and Manhattan distance metrics. Quite often, these metrics can be used interchangeably. Cosine similarity metric: The cosine metric measures the similarity of two data points based on their orientation, as determined by taking the cosine of the angle between them. Lastly, for non-numeric data, you can use metrics like the Jaccard distance metric, an index that compares the number of features that two observations have in common. For example, to illustrate a Jaccard distance, look at these two text strings: Saint Louis de Ha-ha, Quebec St-Louis de Ha!Ha!, QC What features do these text strings have in common? And what features are different between them? The Jaccard metric generates a numerical index value that quantifies the similarity between text strings. Identifying Clusters in Your Data You can use many different algorithms for clustering, but the speed and robustness of the k-means algorithm makes it a popular choice among experienced data scientists. As alternatives, kernel density estimation methods, hierarchical algorithms, and neighborhood algorithms are also available to help you identify clusters in your dataset. Clustering with the k-means algorithm The k-means clustering algorithm is a simple, fast, unsupervised learning algorithm that you can use to predict groupings within a dataset. The model makes its prediction based on the number of
Page 3 and 4:
Data Science For Dummies ® , 2nd E
Page 5 and 6:
Data Science For Dummies® To view
Page 7 and 8:
Choosing a Data Graphic Chapter 10:
Page 9 and 10:
Getting to Know Knoema Data Queuing
Page 11 and 12:
costs. I’ve worked hard to make s
Page 13 and 14:
Foreword We live in exciting, even
Page 15 and 16:
Part 1 Getting Started with Data Sc
Page 17 and 18:
Chapter 1 Wrapping Your Head around
Page 19 and 20:
and the well-being of their busines
Page 21 and 22:
quantitative description of the wor
Page 23 and 24:
Alternatives Organizations and thei
Page 25 and 26:
support enhancements, finance and b
Page 27 and 28:
Because the three Vs of big data ar
Page 29 and 30:
FIGURE 2-1: Popular sources of big
Page 31 and 32: Defining data engineering If engine
Page 33 and 34: volumes of data in-batch — where
Page 35 and 36: there yet. Real-time, stream-proces
Page 37 and 38: estrictive. MPP is quicker and easi
Page 39 and 40: The company had the following three
Page 41 and 42: Decrease financial risks. A busines
Page 43 and 44: Granularity is a measure of a datas
Page 45 and 46: you. Unless you’re a data technol
Page 47 and 48: The term multivariate refers to mor
Page 49 and 50: management. Making business value f
Page 51 and 52: dashboards and tabular data reports
Page 53 and 54: Part 2 Using Data Science to Extrac
Page 55 and 56: Chapter 4 Machine Learning: Learnin
Page 57 and 58: dataset composed of historical valu
Page 59 and 60: activation function is a mathematic
Page 61 and 62: applications have been known to imp
Page 63 and 64: out a smaller section of the datase
Page 65 and 66: To understand discrete and continuo
Page 67 and 68: Ranking variable-pairs using Spearm
Page 69 and 70: in shared variance — when a varia
Page 71 and 72: virtually riskless investments (U.S
Page 73 and 74: Logistic regression Logistic regres
Page 75 and 76: It is cumbersome to detect outliers
Page 77 and 78: FIGURE 5-4: A comparison of pattern
Page 79 and 80: Chapter 6 Using Clustering to Subdi
Page 81: FIGURE 6-1: A simple scatter plot.
Page 85 and 86: are separated by wide, sparse areas
Page 87 and 88: FIGURE 6-4: A schematic layout of a
Page 89 and 90: With DBScan, take an iterative, tri
Page 91 and 92: IN THIS CHAPTER Chapter 7 Modeling
Page 93 and 94: FIGURE 7-1: A classification of Wor
Page 95 and 96: located. What to do? One easy solut
Page 97 and 98: FIGURE 7-2: The distances between t
Page 99 and 100: the kNN algorithm has been ranked a
Page 101 and 102: Solving Real-World Problems with Ne
Page 103 and 104: Chapter 8 Building Models That Oper
Page 105 and 106: operating. Later in this chapter, y
Page 107 and 108: Taking on time series Most IoT sens
Page 109 and 110: in order to operate and drive thems
Page 111 and 112: IN THIS PART … Explore the princi
Page 113 and 114: order to help members of this audie
Page 115 and 116: What social, political, caused-base
Page 117 and 118: FIGURE 9-1: This design style conve
Page 119 and 120: to create a sense of relative persp
Page 121 and 122: KNOWING WHEN TO GET PERSUASIVE Pers
Page 123 and 124: FIGURE 9-6: A bar chart. Source: Ly
Page 125 and 126: same category. To ensure that it do
Page 127 and 128: FIGURE 9-12: A stacked chart. FIGUR
Page 129 and 130: Source: Lynda.com, Python for DS FI
Page 131 and 132: clustering and machine learning met
Page 133 and 134:
Source: Lynda.com, Python for DS FI
Page 135 and 136:
Though many data graphic types can
Page 137 and 138:
If you want users to be able to int
Page 139 and 140:
Just in case you’re not aware, HT
Page 141 and 142:
Bringing in the web servers and PHP
Page 143 and 144:
}, { position: 4, quantity: 10 }];
Page 145 and 146:
var scale_y = d3.scale.linear() .do
Page 147 and 148:
Chapter 11 Web-Based Applications f
Page 149 and 150:
For all you techies out there, a co
Page 151 and 152:
You can also use the dashboard to a
Page 153 and 154:
FIGURE 11-3: Choropleth map in Open
Page 155 and 156:
Map layers are spatial datasets tha
Page 157 and 158:
FIGURE 11-6: A bar chart in iCharts
Page 159 and 160:
fashion using a data-graphing appli
Page 161 and 162:
need to produce documents in an inf
Page 163 and 164:
clunky. These design elements consu
Page 165 and 166:
nitty-gritty details. Even if every
Page 167 and 168:
Chapter 13 Making Maps from Spatial
Page 169 and 170:
FIGURE 13-2: Spatial data described
Page 171 and 172:
FIGURE 13-4: A street and neighborh
Page 173 and 174:
area, sum up the volume of snow tha
Page 175 and 176:
Now that you know what types of coo
Page 177 and 178:
FIGURE 13-8: Buffered features at t
Page 179 and 180:
the values of an attribute in a vec
Page 181 and 182:
FIGURE 13-11: Your new QGIS setup.
Page 183 and 184:
FIGURE 13-13: A layer added into QG
Page 185 and 186:
FIGURE 13-15: Layer properties in Q
Page 187 and 188:
Part 4 Computing for Data Science
Page 189 and 190:
IN THIS CHAPTER Chapter 14 Using Py
Page 191 and 192:
programmed to do. Classes, on the o
Page 193 and 194:
Sets in Python A set is another dat
Page 195 and 196:
def average(any_list):return(sum(an
Page 197 and 198:
Be sure to import the library into
Page 199 and 200:
SciPy offers functionalities and al
Page 201 and 202:
FIGURE 14-2: Time-series plot of mo
Page 203 and 204:
When you download your free Python
Page 205 and 206:
import numpy as np import matplotli
Page 207 and 208:
square brackets) and then turn thos
Page 209 and 210:
Chapter 15 Using Open Source R for
Page 211 and 212:
mode. Data frames are structured in
Page 213 and 214:
handle that task: You simply define
Page 215 and 216:
"
Page 217 and 218:
information and use a linear regres
Page 219 and 220:
univariate time series forecasts. O
Page 221 and 222:
Therefore, knowing how to make sens
Page 223 and 224:
marketing, and more.
Page 225 and 226:
whether SQL should be pronounced
Page 227 and 228:
Let the following scenario serve as
Page 229 and 230:
Imagine that you have a text field
Page 231 and 232:
into native R or Python data forms
Page 233 and 234:
LISTING 16-3 A Full Outer JOIN SELE
Page 235 and 236:
Chapter 17 Doing Data Science with
Page 237 and 238:
FIGURE 17-1: The full dataset that
Page 239 and 240:
FIGURE 17-3: Spotting outliers in a
Page 241 and 242:
FIGURE 17-5: Excel XY (scatter) plo
Page 243 and 244:
Automating Excel tasks with macros
Page 245 and 246:
Absolute: After you start recording
Page 247 and 248:
at its public workflow server (see
Page 249 and 250:
IN THIS PART … Explore the impact
Page 251 and 252:
Let me emphasize here, at the begin
Page 253 and 254:
What: Getting Directly to the Point
Page 255 and 256:
journalist you walk a fine line bet
Page 257 and 258:
take tremendous value from consumin
Page 259 and 260:
Because the library won’t budge o
Page 261 and 262:
correlation coefficient of 0.86. Th
Page 263 and 264:
Looking back to the World Bank Glob
Page 265 and 266:
Chapter 19 Delving into Environment
Page 267 and 268:
evolution of EI away from standard
Page 269 and 270:
Non-relational database technologie
Page 271 and 272:
in a lack of stable water resources
Page 273 and 274:
and land cover. Through his recent
Page 275 and 276:
concepts and methods you can use to
Page 277 and 278:
narrative, and conversation. Custom
Page 279 and 280:
Webtrends (http://webtrends.com): O
Page 281 and 282:
click heat map data visualizations
Page 283 and 284:
functionality for A/B split testing
Page 285 and 286:
Segment Builder, check out the Goog
Page 287 and 288:
geographic region. After you distin
Page 289 and 290:
temporally relevant but not geo-ref
Page 291 and 292:
FIGURE 21-1: A map product derived
Page 293 and 294:
ehavior and information about prese
Page 295 and 296:
the time). Officer Bob, on said str
Page 297 and 298:
Part 6
Page 299 and 300:
IN THIS PART … Find out all about
Page 301 and 302:
national statistics, election resul
Page 303 and 304:
yet available. Like Data.gov (discu
Page 305 and 306:
Geology Engineering Some examples f
Page 307 and 308:
FIGURE 22-1: The index of insect re
Page 309 and 310:
When you work on collaborative proj
Page 311 and 312:
and tools, you can create results t
Page 313 and 314:
ideas behind web-scraping in Chapte
Page 315 and 316:
Missing data can indicate a formatt
Page 317 and 318:
FIGURE 23-3: A Gephi hairball graph
Page 319 and 320:
Checking out Knoema’s data visual
Page 321 and 322:
FIGURE 23-6: A map of Eurostat data
Page 323 and 324:
Dedication I dedicate this book to
Page 325 and 326:
Publisher’s Acknowledgments Acqui
Page 327 and 328:
Create your own Dummies book cover
show all

348957348957

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?