14.08.2017 Views

Creative Data Mining : Documentation of the teaching results from the Spring Semester 2017

Creative Data Mining : Documentation of the teaching results from the Spring Semester 2017

Creative Data Mining : Documentation of the teaching results from the Spring Semester 2017

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Documentation</strong> <strong>of</strong> <strong>the</strong> <strong>teaching</strong> <strong>results</strong> <strong>from</strong> <strong>the</strong> spring semester <strong>2017</strong><br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong><br />

Danielle Griego, Dani Zünd and Pr<strong>of</strong>. Dr. Gerhard Schmitt


Chair <strong>of</strong> Information Architecture<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong><br />

<strong>Documentation</strong> <strong>of</strong> <strong>teaching</strong> <strong>results</strong><br />

Danielle Griego, Dani Zünd, and Gerhard Schmitt<br />

Chair <strong>of</strong> Information Architecture<br />

Teaching<br />

Danielle Griego, Dani Zünd and Gerhard Schmitt<br />

Syllabi<br />

http://www.ia.arch.ethz.ch/category/<strong>teaching</strong>/fs<strong>2017</strong>-creative-data-mining/<br />

Seminar<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong><br />

Students<br />

Biyu Wang, Jiani Liu, Jack Kenning, Hong Liang, Gong Chen, Zhonghau Dai<br />

Published by<br />

Swiss Federal Institute <strong>of</strong> Technology in Zurich (ETHZ)<br />

Department <strong>of</strong> Architecture<br />

Institute <strong>of</strong> Technology in Architecture<br />

Chair <strong>of</strong> Information Architecture<br />

Wolfgang-Pauli-Strasse 27, HIT H 31.6<br />

8093 Zurich<br />

Switzerland<br />

Zurich, August <strong>2017</strong><br />

Layout<br />

Brigitte M. Clements<br />

Contact<br />

grigod@arch.ethz.ch | http://www.ia.arch.ethz.ch/griegod/<br />

Cover picture:<br />

Front side: Labelled Frames. Jack Kenning


Course Description and Program<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong>: Uncover and Evaluate<br />

Mondays 10:00 - 12:00<br />

051-0726-17U | 2 ECTS*<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong><br />

Uncover and Evaluate<br />

The participants <strong>of</strong> this course learn how to collect, process,<br />

analyze and interpret real spatial and temporal data in order to<br />

by using actual data <strong>from</strong> a recent study and analysing it with<br />

about <strong>the</strong> urban environment and test <strong>the</strong> stated hypo<strong>the</strong>sis<br />

students with a skill-set to fur<strong>the</strong>r support <strong>the</strong>ir design and<br />

decision making processes.<br />

The course focuses on creating deeper insights to critically<br />

evaluate design alternatives for urban planning projects. Students<br />

will work with time-series and geo-referenced data including<br />

temperature, relative humidity, illuminance, noise, people density,<br />

and dust particulate matter. Subjective impression survey data<br />

will also be integrated into <strong>the</strong> student projects to fur<strong>the</strong>r explore<br />

20.02.<strong>2017</strong><br />

27.02.<strong>2017</strong><br />

06.03.<strong>2017</strong><br />

13.03.<strong>2017</strong><br />

20.03.<strong>2017</strong><br />

27.03.<strong>2017</strong><br />

03.04.<strong>2017</strong><br />

10.04.<strong>2017</strong><br />

17.04.<strong>2017</strong><br />

24.04.<strong>2017</strong><br />

01.05.<strong>2017</strong><br />

08.05.<strong>2017</strong><br />

15.05.<strong>2017</strong><br />

Course Introduction<br />

Introduction to Python & Programming I<br />

Introduction to Python & Programming II<br />

<strong>Data</strong> Processing<br />

Seminar week (No lecture)<br />

Intro to time-series data analysis<br />

Time series data analysis ctd. & Machine learning<br />

Machine learning ctd.<br />

Holiday (No lecture)<br />

Programming tutorial applications<br />

Holiday (No lecture)<br />

Q&A Feedback Workshop I<br />

Final iA critique<br />

Combined critique with <strong>the</strong> o<strong>the</strong>r iA courses<br />

(13:00 - 18:00)<br />

experiences. Non-architectural skills <strong>the</strong> participants can develop<br />

during this course are 1) an introduction to programming 2) how<br />

clustering methods like PCA or K-Means could be applied in an<br />

architectural context.<br />

Requirement<br />

Former knowledge <strong>of</strong> any digital tool or coding language is<br />

most welcome but NOT required. You only need to provide a<br />

reasonable amount <strong>of</strong> motivation and <strong>of</strong> course a notebook.<br />

Where<br />

HIT H 34.1 (Video Wall)<br />

Supervision<br />

Danielle Griego<br />

Daniel Zünd<br />

Artem Chirkin<br />

griego@arch.ethz.ch<br />

zuend@arch.ethz.ch<br />

chirkin@arch.ethz.ch<br />

* Total 60 h = 2 ECTS<br />

Ungraded <strong>Semester</strong> Performance<br />

The most recent outline will be found on www.ia.arch.ethz.ch<br />

Pr<strong>of</strong>. Dr. Gerhard Schmitt<br />

Chair <strong>of</strong> Information Architecture<br />

Information Science Lab<br />

Wolfgang-Pauli-Strasse 27, 8093 Zurich<br />

www.ia.arch.ethz.ch


Content<br />

Does Distance Affect our Feelings?<br />

Student: Biyu Wang & Jiani Liu<br />

Urban Perception: A deep learning approach<br />

Student: Jack Kenning<br />

Street Cross Section<br />

Student: Hong Liang<br />

Security Analysis<br />

Student: Gong Chen & Zhonghau Dai<br />

p.9<br />

p.32<br />

p.49<br />

p.103


Objectivity, Subjectivity, Colour<br />

Student: Andrea Panzeri


Motivation<br />

As two geography students, Tobler’s first law <strong>of</strong> geography is one <strong>of</strong> <strong>the</strong> most important rules we<br />

learned about spatial relationship:<br />

Everything is related to everything else, but near things are more related than distant things.<br />

This applies to many issues in real world. For example, if you map population density <strong>of</strong> a country, it<br />

is highly possible that a dense city is surrounded also by dense cities. We want to examine if this also<br />

applies to <strong>the</strong> cognition world.<br />

In <strong>the</strong> ESUM (Analyzing trade-<strong>of</strong>fs between Energy and Social performance <strong>of</strong> Urban Morphologies)<br />

project, 32 participants were asked to go through a path with 14 checkpoints and give feedback<br />

about <strong>the</strong>ir perception. The checkpoints were specifically selected so that <strong>the</strong>y differ <strong>from</strong> each o<strong>the</strong>r.<br />

However, we wonder if <strong>the</strong> participants could feel <strong>the</strong> difference as planned instead <strong>of</strong> relating <strong>the</strong>ir<br />

current perception with <strong>the</strong>ir impression <strong>from</strong> last several checkpoints.<br />

Hypo<strong>the</strong>sis & Research Question<br />

In our project, we investigated if distance affects people’s feeling. Based on Tobler’s first law <strong>of</strong> geography,<br />

we assume that people’s feelings towards a specific place are related to feelings <strong>of</strong> places nearby. Thus,<br />

people’s attitudes towards near space are similar.<br />

The concept <strong>of</strong> distance we used here is not Euclidian distance, but <strong>the</strong> length <strong>of</strong> walking distance in<br />

<strong>the</strong> experiment.<br />

Approach & Methods<br />

Similarity is something hard to measure precisely. We used clustering as a reference: objects in one<br />

cluster are more similar than those in two different clusters.<br />

We used data <strong>from</strong> <strong>the</strong> ESUM project and divided <strong>the</strong>m into 3 groups: survey <strong>results</strong> which reflect<br />

people’s subjective perception, bi<strong>of</strong>eedback data which measures people’s physical reaction, and<br />

environment data which reflects <strong>the</strong> objective difference. Without <strong>of</strong>fering geometry data, we checked<br />

if <strong>the</strong> cluster distribution conforms to certain spatial patterns.<br />

10<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation


We We did did clustering for for each each group group participant by by participant ra<strong>the</strong>r ra<strong>the</strong>r than than aggregating 32 32 people’s data<br />

toge<strong>the</strong>r, data toge<strong>the</strong>r, mainly because mainly cognition because cognition among people among as people well as as dynamic well as features dynamic <strong>of</strong> environment features <strong>of</strong> are<br />

different. environment For example, are different. some people For example, might find some it noisy, people but might o<strong>the</strong>rs find might it noisy, not evaluate but o<strong>the</strong>rs sounds might with not same<br />

decibel evaluate value. sounds Some with participants same decibel walked value. in <strong>the</strong> Some morning, participants while o<strong>the</strong>rs walked might in do <strong>the</strong> <strong>the</strong> morning, same task while in <strong>the</strong><br />

afternoon with significant higher temperature. We also standardized data, since different variable varies<br />

o<strong>the</strong>rs might do <strong>the</strong> same task in <strong>the</strong> afternoon with significant higher temperature. We also<br />

in different scale.<br />

standardized data, since different variable varies in different scale.<br />

For<br />

For<br />

survey<br />

survey<br />

<strong>results</strong>,<br />

<strong>results</strong>,<br />

which<br />

which<br />

are small<br />

are small<br />

in sample<br />

in sample<br />

size, K-means<br />

size, K-means<br />

clustering<br />

clustering<br />

was used<br />

was<br />

and<br />

used<br />

14 checkpoints<br />

and 14<br />

were clustered into three classes for each participant. To compare <strong>the</strong> result, we used a metric to<br />

checkpoints were clustered into three classes for each participant. To compare <strong>the</strong> result, we<br />

compare if two checkpoints are more similar than ano<strong>the</strong>r two: count <strong>of</strong> participants to recognize each<br />

two<br />

used<br />

checkpoints<br />

a metric<br />

into<br />

to compare<br />

one cluster.<br />

if<br />

For<br />

two<br />

all<br />

checkpoints<br />

91 (14*13/2)<br />

are<br />

checkpoint<br />

more similar<br />

pairs,<br />

than<br />

<strong>the</strong> higher<br />

ano<strong>the</strong>r<br />

<strong>the</strong><br />

two:<br />

count<br />

count<br />

is (maximal<br />

<strong>of</strong><br />

32, participants minimal 0), to <strong>the</strong> recognize more similar each <strong>the</strong>y two are. checkpoints We checked into if one checkpoints cluster. For closer all 91 to (14*13/2) each o<strong>the</strong>r checkpoint show higher<br />

similarity. pairs, <strong>the</strong> higher <strong>the</strong> count is (maximal 32, minimal 0), <strong>the</strong> more similar <strong>the</strong>y are. We checked if<br />

checkpoints closer to each o<strong>the</strong>r show higher similarity.<br />

For bi<strong>of</strong>eedback data and environment data, which are almost continuous along <strong>the</strong> path with 666<br />

points, For bi<strong>of</strong>eedback we used DBSCAN data and with environment same parameters data, which so <strong>the</strong> are count almost <strong>of</strong> clusters continuous would along not be <strong>the</strong> predetermined<br />

path with<br />

and 666 <strong>the</strong> points, clusters we are used comparable DBSCAN between with same different parameters dataset. so We <strong>the</strong> mapped count <strong>the</strong> <strong>of</strong> <strong>results</strong> clusters and would compared not be how<br />

<strong>the</strong>se predetermined two clustering and pattern <strong>the</strong> clusters differ <strong>from</strong> are each comparable o<strong>the</strong>r for between <strong>the</strong> same different participant. dataset. We mapped <strong>the</strong><br />

<strong>results</strong> and compared how <strong>the</strong>se two clustering pattern differ <strong>from</strong> each o<strong>the</strong>r for <strong>the</strong> same<br />

participant.<br />

Results & Discussion<br />

Subjective feeling<br />

Results & Discussion<br />

In <strong>the</strong><br />

Subjective<br />

first impression,<br />

feeling<br />

it looks like <strong>the</strong> first several checkpoints are more into one cluster, and <strong>the</strong> last<br />

several<br />

In <strong>the</strong><br />

in<br />

first<br />

ano<strong>the</strong>r,<br />

impression,<br />

while checkpoints<br />

it looks like<br />

in<br />

<strong>the</strong><br />

<strong>the</strong><br />

first<br />

middle<br />

several<br />

show<br />

checkpoints<br />

more uncertainty.<br />

are more<br />

See<br />

into<br />

example<br />

one cluster,<br />

<strong>of</strong> participant<br />

and<br />

No.12 in figure 1. Though, <strong>the</strong>re are also a few exceptions.<br />

<strong>the</strong> last several in ano<strong>the</strong>r, while checkpoints in <strong>the</strong> middle show more uncertainty. See<br />

example <strong>of</strong> participant No.12 in figure 1. Though, <strong>the</strong>re are also a few exceptions.<br />

Figure 1 Cluster result <strong>of</strong> participant No.12<br />

Then we calculated <strong>the</strong> similarity parameters explained before and <strong>the</strong> <strong>results</strong> are shown in<br />

figure 2. 14 checkpoints were place on a circle and <strong>the</strong> width <strong>of</strong> <strong>the</strong> chords connecting two<br />

Then checkpoints we calculated represents <strong>the</strong> similarity <strong>the</strong> similarity parameters metric explained <strong>of</strong> <strong>the</strong>se before two. Note and that <strong>the</strong> <strong>results</strong> checkpoint are shown 1 and 14 in figure are <strong>the</strong> 2. 14<br />

checkpoints were place on a circle and <strong>the</strong> width <strong>of</strong> <strong>the</strong> chords connecting two checkpoints represents<br />

<strong>the</strong> similarity metric <strong>of</strong> <strong>the</strong>se two. Note that checkpoint 1 and 14 are <strong>the</strong> starting and ending points<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects<br />

which is <strong>the</strong> fur<strong>the</strong>st pair although <strong>the</strong>y were placed close to each o<strong>the</strong>r. Similarly, 1 and 13, 2 and 14<br />

are also very far away. Then it shows a tendency that checkpoints closer to each o<strong>the</strong>r give out a higher<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation<br />

11


similarity in general. Take checkpoint 11 for example, it’s clear that chords by two sides are much wider<br />

than starting chords and in ending <strong>the</strong> middle. points which is <strong>the</strong> fur<strong>the</strong>st pair although <strong>the</strong>y were placed close to each<br />

o<strong>the</strong>r. Similarly, 1 and 13, 2 and 14 are also very far away. Then it shows a tendency that<br />

checkpoints closer to each o<strong>the</strong>r give out a higher similarity in general. Take checkpoint 11 for<br />

starting and ending points which is <strong>the</strong> fur<strong>the</strong>st pair although <strong>the</strong>y were placed close to each<br />

example, it’s clear that chords by two sides are much wider than chords in <strong>the</strong> middle.<br />

starting o<strong>the</strong>r. Similarly, and ending 1 and points 13, which 2 and is 14 <strong>the</strong> are fur<strong>the</strong>st also very pair far although away. Then <strong>the</strong>y were it shows placed a tendency close to each that<br />

o<strong>the</strong>r. checkpoints Similarly, closer 1 to and each 13, o<strong>the</strong>r 2 and give 14 out are a also higher very similarity far away. in Then general. it shows Take checkpoint a tendency 11 that for<br />

checkpoints example, it’s closer clear that to each chords o<strong>the</strong>r by two give sides out a are higher much similarity wider than in general. chords in Take <strong>the</strong> middle. checkpoint 11 for<br />

example, it’s clear that chords by two sides are much wider than chords in <strong>the</strong> middle.<br />

Figure 2: Similarity <strong>of</strong> subjective feelings<br />

Following <strong>the</strong> Moran’s I index measuring two-dimensional autocorrelation, we also calculated<br />

Following <strong>the</strong> overall <strong>the</strong> autocorrelation Moran’s I index for measuring <strong>the</strong> Figure similarity 2: Similarity two-dimensional metric <strong>of</strong> subjective using following autocorrelation, feelings formula: we also calculated <strong>the</strong> overall<br />

autocorrelation for <strong>the</strong> similarity Figure metric 2: using Similarity following <strong>of</strong> subjective feelings<br />

Following <strong>the</strong> Moran’s I index measuring 14 two-dimensional D formula:<br />

∑<br />

ij autocorrelation, we also calculated<br />

j=1 C ij ∗ exp⁡(− ⁡)<br />

Following <strong>the</strong> overall <strong>the</strong> autocorrelation Moran’s I index for <strong>the</strong> measuring similarity two-dimensional metric using ∑ j D ij following autocorrelation, formula: we also calculated<br />

I<br />

<strong>the</strong> overall autocorrelation for <strong>the</strong> i =<br />

similarity ∑14<br />

⁡⁡, j ≠ i<br />

j=1<br />

metric C ij /13using following formula:<br />

14<br />

D<br />

∑<br />

ij<br />

j=1 C ij ∗ exp⁡(−<br />

14<br />

D<br />

⁡)<br />

∑ D<br />

Cij is <strong>the</strong> similar metric and Dij is <strong>the</strong> ∑walking ij<br />

j=1 C ij ∗<br />

distance.<br />

exp⁡(− j It actually ij<br />

⁡)<br />

uses distance-weighted average<br />

I i =<br />

∑14<br />

∑ j D<br />

⁡⁡, j ≠ i<br />

<strong>of</strong> similarity metric divided by ij<br />

I<br />

arithmetic<br />

i =<br />

average. j=1 C ij /13 When similar<br />

∑14<br />

⁡⁡, j ≠<br />

metric<br />

i<br />

is evenly distributed, <strong>the</strong><br />

Ii would equal 1. Ii larger than 1 illustrates j=1that C ij /13 in general closer points have higher similar<br />

Cij is <strong>the</strong> similar metric and Dij is <strong>the</strong> walking distance. It actually uses distance-weighted average<br />

metric. The result shows that for most checkpoints it confirms our assumption but only check<br />

Cij Cij <strong>of</strong> is is similarity <strong>the</strong> similar metric metric divided and by Dij is arithmetic is <strong>the</strong> <strong>the</strong> walking average. distance. When It actually similar It actually uses metric uses evenly distance-weighted distributed, average <strong>the</strong> average <strong>of</strong><br />

point 7 and 8.<br />

similarity <strong>of</strong> Ii would similarity metric equal metric divided 1. Ii divided larger by arithmetic than by arithmetic 1 illustrates average. that When When in general similar closer metric points is is evenly have distributed, higher similar<br />

<strong>the</strong> Ii would<br />

equal<br />

Ii metric. Table would 1. 1 Overall Ii The equal larger result autocorrelation 1. than<br />

Ii shows larger 1 illustrates that Index than for 1 each most that illustrates in checkpoints general that closer it general confirms points closer our have assumption points higher have similar but higher only metric. similar check The result<br />

shows metric. point<br />

that<br />

7 and The for<br />

8. result most shows checkpoints that for it most confirms checkpoints our assumption it confirms but our only assumption check point but 7 only and check 8.<br />

Check Point 1 2 3 4 5 6 7<br />

point 7 and 8.<br />

Table Overall 1 Overall Correlation autocorrelation 1.87 Index for 1.99 each check point 1.33 1.35 1.33 1.12 0.49<br />

Table Check 1 Overall Point autocorrelation 8 Index for each 9 check point 10 11 12 13 14<br />

Check Overall Point Correlation 0.81 1 1.07 2 1.15 3 1.40 4 1.47 5 1.47 6 1.15 7<br />

Check Overall Point Correlation 1.87 1 1.99 2 1.333 1.35 4 1.33 5 1.12 6 0.49 7<br />

Overall Check Point Correlation 1.878 1.999 1.33 10 1.35 11 1.33 12 1.12 13 0.49 14<br />

Check Overall One more Point Correlation thing that worth 0.818 our attention: 1.07 9 checkpoint 1.15 10 9 1.40 and 11 14 are 1.47 actually 12 <strong>the</strong> 1.47 13 same places 1.15 14 (see<br />

Overall figure 1 Correlation on <strong>the</strong> map). However, 0.81 <strong>the</strong> 1.07 similar metric 1.15 is only 1.40 24, meaning 1.47 that only 1.47 2/3 respondents 1.15<br />

feel it similarly when <strong>the</strong>y come back to <strong>the</strong> exact same place several minutes later. So <strong>the</strong>re<br />

One more thing that worth our attention: checkpoint 9 and 14 are actually <strong>the</strong> same places (see<br />

must be something affecting people’s feelings except <strong>the</strong> environment <strong>of</strong> <strong>the</strong> place itself.<br />

One figure more 1 on thing <strong>the</strong> map). that worth However, our attention: <strong>the</strong> similar checkpoint metric is only 9 and 24, 14 meaning are actually that only <strong>the</strong> same 2/3 respondents places (see<br />

figure feel it 1 similarly on <strong>the</strong> map). when However, <strong>the</strong>y come <strong>the</strong> back similar to <strong>the</strong> metric exact is only same 24, place meaning several that minutes only 2/3 later. respondents So <strong>the</strong>re<br />

One<br />

feel<br />

more<br />

must <strong>Creative</strong> it be similarly<br />

thing that<br />

<strong>Data</strong> something <strong>Mining</strong> when<br />

worth<br />

affecting | <strong>Spring</strong> <strong>the</strong>y<br />

our<br />

come<br />

attention:<br />

<strong>2017</strong> people’s | back Final feelings Projects to<br />

checkpoint<br />

<strong>the</strong> exact except same<br />

9 and<br />

<strong>the</strong> environment place<br />

14 are<br />

several<br />

actually<br />

<strong>of</strong> minutes<br />

<strong>the</strong> same<br />

<strong>the</strong> place later.<br />

places<br />

itself. So <strong>the</strong>re<br />

(see figure<br />

1 on <strong>the</strong> map). However, <strong>the</strong> similar metric is only 24, meaning that only 2/3 respondents feel it similarly<br />

must be something affecting people’s feelings except <strong>the</strong> environment <strong>of</strong> <strong>the</strong> place itself.<br />

when <strong>the</strong>y come back to <strong>the</strong> exact same place several minutes later. So <strong>the</strong>re must be something<br />

affecting <strong>Creative</strong> people’s <strong>Data</strong> <strong>Mining</strong> feelings | <strong>Spring</strong> except <strong>2017</strong> | <strong>the</strong> Final environment Projects <strong>of</strong> <strong>the</strong> place itself.<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects<br />

12<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation


<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects<br />

Physical reaction and environment difference<br />

Generally Speaking, bi<strong>of</strong>eedback data follows specific spatial pattern, as <strong>the</strong> closer points tend to<br />

be in <strong>the</strong> same cluster as well as <strong>the</strong>re are less clusters for each participant, even though we do not<br />

take spatial information into consideration during clustering; while participants’ environment data is<br />

opposite, which seems to be more randomly distributed. It makes sense since environment data such<br />

as <strong>the</strong> level Physical <strong>of</strong> sound, reaction dust, and etc. environment can be suddenly difference high when a truck pass by and can go back to normal<br />

level immediately after it runs away. Therefor <strong>the</strong> result can be interpreted as people’s feelings are<br />

relatively Generally stable Speaking, and change bi<strong>of</strong>eedback gradually data in spite follows <strong>of</strong> <strong>the</strong> specific sudden spatial change pattern, in environment.<br />

as <strong>the</strong> closer points tend<br />

to be in <strong>the</strong> same cluster as well as <strong>the</strong>re are less clusters for each participant, even though we<br />

Fur<strong>the</strong>rmore, do not take <strong>the</strong> spatial comparison information between into consideration participants’ during bi<strong>of</strong>eedback clustering; data while and participants’ environment data can<br />

be summarized environment as data three is opposite, groups, namely, which seems bi<strong>of</strong>eedback to be more data randomly fluctuates distributed. more than It makes environment sense data,<br />

bi<strong>of</strong>eedback since environment data and environment data such as data <strong>the</strong> level conform <strong>of</strong> sound, to similar dust, etc. pattern, can be and suddenly bi<strong>of</strong>eedback high when data a truck fluctuates<br />

less than pass environment by and can go data. back As to we normal expected, level immediately <strong>the</strong> first two after pattern it runs exist away. but Therefor only minority. <strong>the</strong> result Interesting can<br />

is that, be even interpreted <strong>the</strong>se as cases, people’s actually feelings bi<strong>of</strong>eedback are relatively data stable still and conforms change gradually to specific in spite spatial <strong>of</strong> <strong>the</strong> pattern. The<br />

possible<br />

sudden<br />

reason<br />

change<br />

is probably<br />

in environment.<br />

<strong>the</strong> too stable environment ra<strong>the</strong>r than sudden changes in bi<strong>of</strong>eedback.<br />

Fur<strong>the</strong>rmore, <strong>the</strong> comparison between participants’ bi<strong>of</strong>eedback data and environment data<br />

can be summarized as three groups, namely, bi<strong>of</strong>eedback data fluctuates more than<br />

environment data, bi<strong>of</strong>eedback data and environment data conform to similar pattern, and<br />

bi<strong>of</strong>eedback data fluctuates less than environment data. As we expected, <strong>the</strong> first two pattern<br />

exist but only for minority. Interesting is that, even in <strong>the</strong>se cases, actually bi<strong>of</strong>eedback data still<br />

conforms to specific spatial pattern. The possible reason is probably <strong>the</strong> too stable environment<br />

ra<strong>the</strong>r than sudden changes in bi<strong>of</strong>eedback.<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation<br />

13


The pattern “environment data fluctuates more” is dominant, as 22 out <strong>of</strong> 29 participants’ show<br />

this pattern, which proves our hypo<strong>the</strong>sis more or less that bi<strong>of</strong>eedback data possess spatial<br />

The<br />

similarity.<br />

pattern<br />

As<br />

“environment<br />

some examples<br />

data<br />

shown<br />

fluctuates<br />

below,<br />

more”<br />

closer<br />

is dominant,<br />

space is highly<br />

as 22<br />

likely<br />

out<br />

to<br />

<strong>of</strong><br />

be<br />

29<br />

classified<br />

participants’<br />

in <strong>the</strong><br />

show<br />

this pattern, which proves our hypo<strong>the</strong>sis more or less that bi<strong>of</strong>eedback data possess spatial<br />

same cluster even though <strong>the</strong> corresponding environment does not show spatial correlation.<br />

similarity. As some examples shown below, closer space is highly likely to be classified in <strong>the</strong><br />

The Therefore, pattern “environment we can confirm data that fluctuates people react more” to environment is dominant, as changes 22 out in <strong>of</strong> a 29 more participants’ smooth way.<br />

same cluster even though <strong>the</strong> corresponding environment does not show spatial correlation. show<br />

this Besides, pattern, <strong>the</strong>re which are proves always our more hypo<strong>the</strong>sis outliers, more which or is less represented that bi<strong>of</strong>eedback as black point data possess along path spatial in <strong>the</strong><br />

similarity. Therefore, graph, in As environment we some can examples confirm data that which shown people also below, proves react closer its to space environment instability. is highly changes likely to be in a classified more smooth in <strong>the</strong> way.<br />

same Besides, cluster <strong>the</strong>re even are though always <strong>the</strong> more corresponding outliers, which environment is represented does not as show black spatial point correlation. along path in <strong>the</strong><br />

Therefore, graph, in environment we can confirm data that which people also react proves to environment its instability. changes in a more smooth way.<br />

Besides, <strong>the</strong>re are always more outliers, which is represented as black point along path in <strong>the</strong><br />

graph, in environment data which also proves its instability.<br />

Conclusions<br />

Through our three groups <strong>of</strong> clustering, we concluded that in general, people’s feelings towards<br />

a specific place are related to feelings <strong>of</strong> places nearby.<br />

14<br />

Conclusions <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects<br />

Through our New three Methods groups in <strong>Creative</strong> <strong>of</strong> clustering, <strong>Data</strong> <strong>Mining</strong> we concluded | Final that project in general, documentation people’s feelings towards<br />

a specific place are related to feelings <strong>of</strong> places nearby.


Conclusions<br />

Through our three groups <strong>of</strong> clustering, we concluded that in general, people’s feelings towards a<br />

specific place are related to feelings <strong>of</strong> places nearby. People tend to feel that close places are similar<br />

both physically, which is proved by bi<strong>of</strong>eedback data, and mentally, which is proved by subjective<br />

survey data. Not only your brain feels that way, so do o<strong>the</strong>r parts <strong>of</strong> your body.<br />

However, <strong>the</strong> conclusion does not apply to all participants in <strong>the</strong> experiment. The validity is also limited<br />

by <strong>the</strong> sample size. We only confirmed that based on <strong>the</strong> ESUM project data, most people’s feelings<br />

are also affected by distance besides <strong>the</strong> actual environment. Fur<strong>the</strong>r study on larger sample is needed<br />

to validate our conclusion.<br />

References<br />

Tobler, W. R. (1970). A computer movie simulating urban growth in <strong>the</strong> Detroit region. Economic<br />

geography, 46(sup1), 234-240.<br />

Scikit-learn(2011): Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation<br />

15


Urban Perception: A deep learning approach<br />

Student: Jack Kenning


mation process and ends with a visualization step, making <strong>the</strong> <strong>results</strong> visible. From<br />

<strong>Data</strong> Set (Energy and Social Performance <strong>of</strong> Urban Morphologies) collected b<br />

tion Architecture Motivation chair <strong>of</strong> ETH Zürich, <strong>the</strong>re were survey responses <strong>from</strong> specific path-p<br />

ty <strong>of</strong> Zürich. The survey asked respondents at each path-point a series <strong>of</strong> questions, su<br />

The creative data mining course leads students through <strong>the</strong> process <strong>of</strong> data mining with a specific questionbased<br />

is this approach. location?" The collection, or "How cleaning, ordered selection do <strong>of</strong> you relevant find data this is location?". followed by a One transformation popular app<br />

eautiful<br />

process and ends with a visualization step, making <strong>the</strong> <strong>results</strong> visible. From <strong>the</strong> ESUM <strong>Data</strong> Set (Energy<br />

mpare <strong>the</strong> subjective survey responses to an objective measure that could guide a<br />

and Social Performance <strong>of</strong> Urban Morphologies) collected by <strong>the</strong> Information Architecture chair <strong>of</strong> ETH<br />

rban Zürich, design <strong>the</strong>re process. were survey responses <strong>from</strong> specific path-points in <strong>the</strong> city <strong>of</strong> Zürich. The survey asked<br />

respondents at each path-point a series <strong>of</strong> questions, such as: “How beautiful is this location?” or<br />

“How ordered do you find this location?”. One popular approach is to compare <strong>the</strong> subjective survey<br />

responses to an objective measure that could guide a more global urban design process.<br />

sis & Research Questions<br />

Hypo<strong>the</strong>sis & Research Questions<br />

an element that seemed to be under evaluated was <strong>the</strong> presence <strong>of</strong> vegetation and<br />

e can<br />

One<br />

compare<br />

urban element<br />

this<br />

that<br />

to<br />

seemed<br />

how beautiful,<br />

to under<br />

interesting<br />

evaluated was<br />

or<br />

<strong>the</strong><br />

ordered<br />

presence <strong>of</strong><br />

that<br />

vegetation<br />

people<br />

and<br />

found<br />

green<br />

a sp<br />

areas. We can compare this to how beautiful, interesting or ordered that people found a specific point.<br />

“Trees make an urban situation more beautiful.” Do <strong>the</strong>y also make it more interesting and/or more<br />

ordered?<br />

ake an urban situation more beautiful."<br />

also make it more interesting and/or more ordered?<br />

Approach & Methods<br />

The ESUM Project <strong>Data</strong> was a valuable source, but it was necessary to collect o<strong>the</strong>r data to evaluate<br />

<strong>the</strong> presence <strong>of</strong> vegetation in an urban context.<br />

ch & Methods<br />

M Project <strong>Data</strong> was a valuable source, but it was necessary to collect o<strong>the</strong>r data to<br />

e <strong>the</strong> presence <strong>of</strong> vegetation in an urban context.<br />

Figure 1: <strong>Data</strong> Sources<br />

<strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects<br />

18<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation


The presence <strong>of</strong> vegetation is not something that is easily measured, but a logical approach is to use<br />

The<br />

an<br />

presence<br />

image-based<br />

<strong>of</strong> vegetation<br />

method.<br />

is<br />

This<br />

not something<br />

also makes<br />

that<br />

sense<br />

is easily<br />

as <strong>the</strong><br />

measured,<br />

research<br />

but<br />

question<br />

a logical<br />

directly<br />

approach<br />

relates<br />

is to use<br />

to an<br />

an<br />

image-based method. This also makes sense as <strong>the</strong> research question directly relates to an individual’s<br />

individual's perception. However, it is not possible to simply identify vegetation by <strong>the</strong> color <strong>of</strong> a<br />

perception. However, it is not possible to simply identify vegetation by <strong>the</strong> color <strong>of</strong> a pixel for example,<br />

because pixel The presence for many example, things <strong>of</strong> vegetation because in an urban is many not context something things are green an that urban is and easily context this measured, would are result green but in a and a logical large this approach error. would result is to in use a<br />

large an image-based error. method. This also makes sense as <strong>the</strong> research question directly relates to an<br />

Deep<br />

Deep individual's<br />

learning<br />

learning perception.<br />

based convolution<br />

based on However, convolution it<br />

neural<br />

is neural not<br />

networks<br />

possible<br />

makes<br />

networks to makes simply<br />

it possible<br />

it identify<br />

to<br />

possible vegetation<br />

quite easily<br />

to quite easily by<br />

accomplish<br />

<strong>the</strong> accomplish color <strong>of</strong><br />

this<br />

a<br />

task. Through an established framework, we can train <strong>the</strong> framework to recognize certain features in<br />

this pixel task. for example, Through because an established many things framework, in an urban we can context train <strong>the</strong> are green framework and this to recognize would result certain<br />

a<br />

an image - in this case vegetation (trees, bushes, grass, etc.).<br />

features large error. in an image - in this case vegetation (trees, bushes, grass, etc.).<br />

Caffe is a framework developed by <strong>the</strong> Berkeley Vision and Learning Centre and was used in this<br />

Caffe Deep is learning a framework based on developed convolution by <strong>the</strong> neural Berkeley networks Vision makes and Learning it possible Centre to quite and easily was used accomplish<br />

example. The advantage is not requiring much coding knowledge to implement <strong>the</strong> framework. in Below this<br />

is example. this<br />

<strong>the</strong> proposed<br />

task. The Through<br />

workflow. advantage an established not requiring framework, much we coding can train knowledge <strong>the</strong> framework to implement to recognize <strong>the</strong> framework. certain<br />

Below features is <strong>the</strong> in an proposed image - in workflow. this case vegetation (trees, bushes, grass, etc.).<br />

Caffe is a framework developed by <strong>the</strong> Berkeley Vision and Learning Centre and was used in this<br />

example. The advantage is not requiring much coding knowledge to implement <strong>the</strong> framework.<br />

Below is <strong>the</strong> proposed workflow.<br />

The extractor can learn to identify certain features Figure 2: corresponding Workflow to what we are trying to identify <strong>from</strong><br />

<strong>the</strong> training set that we will use and <strong>the</strong>n <strong>the</strong> classifier makes it possible to identify vegetation on any<br />

The extractor can learn to identify certain features corresponding to what we are trying to identify<br />

urban image. These can both be set up in Caffe.<br />

<strong>from</strong> <strong>the</strong> training set that we will use and <strong>the</strong>n <strong>the</strong> classifier makes it possible to identify vegetation<br />

on any urban image. These can both be set Figure up in 2: Caffe. Workflow<br />

The extractor can learn to identify certain features corresponding to what we are trying to identify<br />

<strong>from</strong> <strong>the</strong> training set that we will use and <strong>the</strong>n <strong>the</strong> classifier makes it possible to identify vegetation<br />

on any urban image. These can both be set up in Caffe.<br />

Figure 3: Caffe setup<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects<br />

Figure 3: Caffe setup<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects<br />

19


In this case, <strong>the</strong> project relied largely on SegNet, a modified version <strong>of</strong> Caffe, developed by Cambridge<br />

University. Hand-labeled images are used to generate a predictive model that can segment an image<br />

into different urban elements such as buildings, roads, pavements, trees, vehicles etc. There are 12<br />

different In this categories case, <strong>the</strong> that project SegNet relied can largely be trained on to SegNet, identify a <strong>from</strong> modified <strong>the</strong> labeled version set, <strong>the</strong>se <strong>of</strong> Caffe, were developed <strong>the</strong> categories by<br />

used for <strong>the</strong> analysis, more specifically <strong>the</strong> tree elements.<br />

Cambridge University. Hand-labeled images are used to generate a predictive model that can<br />

segment an image into different urban elements such as buildings, roads, pavements, trees, vehicles<br />

etc. There are 12 different categories that SegNet can be trained to identify <strong>from</strong> <strong>the</strong> labeled set,<br />

<strong>the</strong>se were <strong>the</strong> categories used for <strong>the</strong> analysis, more specifically <strong>the</strong> tree elements.<br />

Figure 4: SegNet, Cambridge University<br />

The pre-trained model runs <strong>the</strong> classification pixel by pixel based on a multitude <strong>of</strong> input factors,<br />

but ultimately outputs a segmented image that can <strong>the</strong>n be used as an objective source <strong>of</strong> urban<br />

The<br />

data.<br />

pre-trained model runs <strong>the</strong> classification pixel by pixel based on a multitude <strong>of</strong> input factors, but<br />

ultimately outputs a segmented image that can <strong>the</strong>n be used as an objective source <strong>of</strong> urban data.<br />

20<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects


Results & Discussion<br />

Results & Discussion<br />

In order Results to take & Discussion into account peripheral vision as someone walks through a neighborhood, we ran <strong>the</strong><br />

segmentation In order to on take three into images, account taken peripheral at eye-level vision <strong>from</strong> as someone a vertical walks position. through a neighborhood, we ran<br />

<strong>the</strong> In order segmentation to take into on account three images, peripheral taken vision at eye-level as someone <strong>from</strong> walks a vertical through position. a neighborhood, we ran<br />

<strong>the</strong> segmentation on three images, taken at eye-level <strong>from</strong> a vertical position.<br />

Figure 5: Segmented images <strong>from</strong> Zürich survey path<br />

Figure 5: Segmented images <strong>from</strong> Zürich survey path<br />

After this segmentation was done, it was relatively simple to read <strong>the</strong> number <strong>of</strong> pixels <strong>of</strong> each color<br />

with After a python this segmentation script and establish was done, <strong>the</strong> percentage it was relatively <strong>of</strong> image simple corresponding to read <strong>the</strong> number to a given <strong>of</strong> pixels category. <strong>of</strong> each The color<br />

first<br />

After with assumption a this python segmentation<br />

was script to plot and was<br />

<strong>the</strong> establish done,<br />

percentage <strong>the</strong> it was percentage <strong>of</strong><br />

relatively<br />

vegetation <strong>of</strong> simple image to different<br />

to corresponding read <strong>the</strong><br />

survey<br />

number<br />

responses to <strong>of</strong> a given pixels<br />

and category. <strong>of</strong><br />

to<br />

each color The<br />

analyze with first <strong>the</strong> assumption a python output: script was and to establish plot <strong>the</strong> percentage <strong>the</strong> percentage <strong>of</strong> vegetation <strong>of</strong> image corresponding to different survey to a given responses category. and The to<br />

first analyze assumption <strong>the</strong> output: was to plot <strong>the</strong> percentage <strong>of</strong> vegetation to different survey responses and to<br />

analyze <strong>the</strong> output:<br />

Figure 6: Survey Responses plotted against <strong>the</strong> segmented tree percentage<br />

Figure 6: Survey Responses plotted against <strong>the</strong> segmented tree percentage<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> New | <strong>Spring</strong> Methods <strong>2017</strong> in | <strong>Creative</strong> Final Projects <strong>Data</strong> <strong>Mining</strong> | Final project documentation<br />

21


The The output output had had a a general coherence but but a lot lot <strong>of</strong> <strong>of</strong> specific cases which didn't didn’t fit to <strong>the</strong> trend <strong>of</strong> a place<br />

being more beautiful with a larger presence <strong>of</strong> vegetation. For this reason, it made more sense to move<br />

being more beautiful with a larger presence <strong>of</strong> vegetation. For this reason, it made more sense to<br />

back<br />

The<br />

towards<br />

move output back had<br />

a more<br />

towards a general<br />

general<br />

a more coherence<br />

analysis<br />

general but<br />

where<br />

analysis a lot<br />

all<br />

where <strong>of</strong><br />

<strong>the</strong><br />

specific<br />

different<br />

all <strong>the</strong> cases<br />

categories<br />

different which didn't<br />

could<br />

categories fit<br />

be<br />

to <strong>the</strong><br />

seen.<br />

could trend<br />

The<br />

be seen. <strong>of</strong><br />

expected<br />

a place<br />

output would be to see an increase in <strong>the</strong> “vegetation” part <strong>of</strong> <strong>the</strong> image in places that <strong>the</strong> survey<br />

The<br />

expected being more output beautiful would with a to larger see an presence increase <strong>of</strong> in vegetation. <strong>the</strong> "vegetation" For this part reason, <strong>of</strong> <strong>the</strong> it made image more in places sense that to<br />

respondents found more beautiful. As <strong>the</strong> graph below shows, that was not necessarily <strong>the</strong> case.<br />

<strong>the</strong> move survey back respondents towards a more found general more analysis beautiful. where As <strong>the</strong> all graph <strong>the</strong> different below shows, categories that was could not be necessarily seen. The<br />

<strong>the</strong> expected case. output would be to see an increase in <strong>the</strong> "vegetation" part <strong>of</strong> <strong>the</strong> image in places that<br />

<strong>the</strong> survey respondents found more beautiful. As <strong>the</strong> graph below shows, that was not necessarily<br />

<strong>the</strong> case.<br />

Figure 7: Relative importance <strong>of</strong> urban elements plotted to survey beauty index<br />

The stacked surfaces<br />

Figure<br />

represent<br />

7: Relative importance<br />

<strong>the</strong> relative<br />

<strong>of</strong> urban<br />

presence<br />

elements<br />

<strong>of</strong> <strong>the</strong><br />

plotted<br />

street,<br />

to survey<br />

<strong>the</strong> buildings,<br />

beauty index<br />

<strong>the</strong> trees and <strong>the</strong><br />

The sky, stacked but it doesn't surfaces appear represent that <strong>the</strong> any relative <strong>of</strong> <strong>the</strong>se presence have a direct <strong>of</strong> <strong>the</strong> influence street, <strong>the</strong> on buildings, how beautiful <strong>the</strong> trees people and found <strong>the</strong> sky, a<br />

but The given it doesn’t stacked location. appear surfaces To be that sure represent any <strong>of</strong> <strong>the</strong> <strong>of</strong> <strong>the</strong>se <strong>the</strong> output, relative have <strong>the</strong> a presence same direct graph influence <strong>of</strong> <strong>the</strong> was street, on generated how <strong>the</strong> beautiful buildings, relative people to <strong>the</strong> <strong>the</strong> trees found questions and a given <strong>the</strong> <strong>of</strong><br />

location. sky, "Like-Dislike", but To it doesn't be sure "Interest-Boring", appear <strong>of</strong> <strong>the</strong> output, that any "Order-Disorder".<br />

<strong>the</strong> <strong>of</strong> <strong>the</strong>se same have graph a direct was generated influence relative on how to beautiful <strong>the</strong> questions people <strong>of</strong> found “Like- a<br />

Dislike”, given location. “Interest-Boring”, To be sure “Order-Disorder”.<br />

<strong>of</strong> <strong>the</strong> output, <strong>the</strong> same graph was generated relative to <strong>the</strong> questions <strong>of</strong><br />

"Like-Dislike", "Interest-Boring", "Order-Disorder".<br />

Figure 8: Relative importance for different survey questions<br />

Figure 8: Relative importance for different survey questions<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects<br />

22<br />

<strong>Creative</strong> <strong>Data</strong> New <strong>Mining</strong> Methods | <strong>Spring</strong> <strong>Creative</strong> <strong>2017</strong> <strong>Data</strong> | Final <strong>Mining</strong> Projects | Final project documentation


Conclusions<br />

The hypo<strong>the</strong>sis and research questions seem to indicate that <strong>the</strong> factors leading to a place being<br />

described as “beautiful”, “ordered”, or “likeable” are more complex than just <strong>the</strong> presence <strong>of</strong> vegetation<br />

in a given place. Multiple factors may have influenced <strong>the</strong>se <strong>results</strong>, such as <strong>the</strong> specific location <strong>of</strong><br />

<strong>the</strong> photos compared to <strong>the</strong> position <strong>of</strong> <strong>the</strong> survey respondents which were not always very precise.<br />

There were also minor (and major) modifications to <strong>the</strong> urban environment with one tree having been<br />

chopped down between <strong>the</strong> survey and <strong>the</strong> photo analysis. The sample-size may also have induced<br />

error in <strong>the</strong>se findings, as only around 30 survey responses were used. With a larger data set it may be<br />

possible to extract more meaningful <strong>results</strong>.<br />

It seems reasonable to imagine that <strong>the</strong> presence <strong>of</strong> trees makes a place-point more beautiful but this<br />

does not directly influence how beautiful a place is. The characteristic appears to be a lot more complex<br />

that solely quantitative. However, <strong>the</strong> first data used <strong>from</strong> <strong>the</strong> segmented image was an average <strong>of</strong> <strong>the</strong><br />

number <strong>of</strong> pixels representing vegetation in <strong>the</strong> photo. This was considered quite a poor measurement<br />

in relation to <strong>the</strong> o<strong>the</strong>r information contained in a segmented image, so in <strong>the</strong> second visualisation <strong>the</strong><br />

relative importance <strong>of</strong> segments is taken into account. For fur<strong>the</strong>r analysis, maybe an indicator such as<br />

<strong>the</strong> distribution across <strong>the</strong> image can be looked at.<br />

References<br />

Caffe - Berkeley Vision and Learning Centre http://caffe.berkeleyvision.org/<br />

SegNet - Cambridge University http://mi.eng.cam.ac.uk/projects/segnet<br />

NVIDIA Caffe Course http://on-demand.gputechconf.com/gtc/2015/webinar/deep-learning-course/<br />

gettingstarted-with-caffe.pdf<br />

ESUM Project - ETH Zürich http://www.ia.arch.ethz.ch/esum/<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation<br />

23


Street Cross Section<br />

Student: Hong Liang


Motivation<br />

Street cross section design is an import part urban planning. In most cases, it depends on <strong>the</strong> road<br />

grade, and <strong>the</strong> property and function <strong>of</strong> <strong>the</strong> road. Once <strong>the</strong> road grade and capacity are confirmed,<br />

<strong>the</strong> type <strong>of</strong> street cross section is roughly decided. Street cross section <strong>of</strong>fers functional services to<br />

pedestrian, on <strong>the</strong> o<strong>the</strong>r hand, different fraction <strong>of</strong> various street cross section elements may create<br />

different atmosphere, which can also effect pedestrian’s feeling and behavior. But how does it work?<br />

In this project, with <strong>the</strong> ESUM data set and some suitable analysis methods, <strong>the</strong> relationship between<br />

street cross section and pedestrian’s reaction will be explored and discussed.<br />

Hypo<strong>the</strong>sis & Research Question(s)<br />

The main question <strong>of</strong> this project is to explore <strong>the</strong> effect <strong>of</strong> street cross section on pedestrian. The<br />

elements <strong>of</strong> street cross section mainly include drive lane, parking lane, bike lane and side walk. So <strong>the</strong><br />

more specific question is: which element plays a more important roll on effecting pedestrian’s feeling?<br />

For example, parking lane could be a negative factor. More wide <strong>the</strong> parking lane is, it could make<br />

pedestrian feel more insecure and chaotic.<br />

Approach & Methods<br />

<strong>Data</strong> Collection and Preparation<br />

In order to describe <strong>the</strong> street cross section along <strong>the</strong> path, I measured every street cross section<br />

element on 14 street points on google map, and compared <strong>the</strong>m with <strong>the</strong> photos. The collected data<br />

and street cross section diagrams are as follow:<br />

26<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation


Table 1 street cross section data <strong>of</strong> 14 points<br />

point street width sidewalk width sidewalk rate parking lane width parking lane rate trees crossing way<br />

1 12 3 0.2500 4.2 0.3500 1 none<br />

2 12 3 0.2500 4.2 0.3500 1 none<br />

3 4.2 1.2 0.2857 0 0.0000 1 none<br />

4 12.6 3.6 0.2857 0 0.0000 1 none<br />

5 12.4 4.3 0.3468 0 0.0000 0 zebra crossing<br />

6 13.5 4.4 0.3259 2.1 0.1556 0 zebra crossing<br />

7 11.3 3.6 0.3186 4.2 0.3717 1 none<br />

8 16.8 3 0.1786 0 0.0000 1 zebra crossing<br />

9 13.8 3.6 0.2609 4.2 0.3043 0 none<br />

10 11.7 3.6 0.3077 2.1 0.1795 1 zebra crossing<br />

11 18.5 4 0.2162 0 0.0000 0 zebra crossing<br />

12 19.3 3.6 0.1865 0 0.0000 0 zebra crossing<br />

13 15 3.6 0.2400 2.1 0.1400 0 zebra crossing<br />

14 13.8 3.6 0.2609 4.2 0.3043 0 none<br />

Figure 1 street cross section diagram <strong>of</strong> 14 points<br />

As shown in <strong>the</strong> diagram above, due to <strong>the</strong> different street grade and function, <strong>the</strong> street cross<br />

section elements at 14 points are different. In particular, street width and parking lane are <strong>the</strong><br />

As primary shown in differences <strong>the</strong> diagram among above, 14 due points. to <strong>the</strong> At different <strong>the</strong> same street time, grade side walk and function, is <strong>the</strong> most <strong>the</strong> important street cross part section <strong>of</strong><br />

elements street for at 14 pedestrian. points are Therefore, different. In I particular, chose street street width width (width and parking between lane two are buildings), <strong>the</strong> primary parking differences<br />

among 14 points. At <strong>the</strong> same time, side walk is <strong>the</strong> most important part <strong>of</strong> street for pedestrian.<br />

lane fraction (parking lane width / street width) and side walk fraction (side walk width / street<br />

Therefore, I chose street width (width between two buildings), parking lane fraction (parking lane width<br />

/ street<br />

width)<br />

width)<br />

to describe<br />

and side<br />

<strong>the</strong><br />

walk<br />

property<br />

fraction<br />

<strong>of</strong><br />

(side<br />

street<br />

walk<br />

cross<br />

width<br />

section.<br />

/ street<br />

Some<br />

width)<br />

binary<br />

to describe<br />

variables,<br />

<strong>the</strong><br />

such<br />

property<br />

as whe<strong>the</strong>r<br />

<strong>of</strong> street<br />

cross <strong>the</strong>re section. are trees Some <strong>of</strong> zebra binary crossings variables, are such also as taken whe<strong>the</strong>r into <strong>the</strong>re consideration. are trees <strong>of</strong> zebra crossings are also taken<br />

into consideration.<br />

To describe pedestrian’s feeling, I chose 3 questions <strong>from</strong> <strong>the</strong> survey result <strong>of</strong> ESUM: like <strong>of</strong><br />

To dislike, describe ordered pedestrian’s <strong>of</strong> chaotic, feeling, secure I chose or insecure. 3 questions ESUM <strong>from</strong> experiment <strong>the</strong> survey was result taken <strong>of</strong> by ESUM: 37 participants. like <strong>of</strong> dislike,<br />

ordered In order <strong>of</strong> to chaotic, compare secure <strong>the</strong> <strong>results</strong> or insecure. with <strong>the</strong> ESUM street experiment cross section’s was taken property by 37 at 14 participants. points more In easily, order to<br />

compare first, I calculate <strong>the</strong> <strong>results</strong> <strong>the</strong> with average <strong>the</strong> street value cross for 3 section’s questions property at 14 points. at 14 points more easily, first, I calculate <strong>the</strong><br />

average value for 3 questions at 14 points.<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects 2<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation<br />

27


Figure 2 survey result<br />

As shown in <strong>the</strong> diagrams above, <strong>the</strong> <strong>results</strong> <strong>of</strong> 3 survey questions present similar tendency. It<br />

As shown can be in also <strong>the</strong> concluded diagrams <strong>from</strong> above, <strong>the</strong> <strong>the</strong> analysis, <strong>results</strong> that <strong>of</strong> 3 <strong>the</strong> survey deviation questions at some present points similar (e.g. point tendency. 7) is It can<br />

be also relatively concluded small, <strong>from</strong> which <strong>the</strong> means analysis, <strong>the</strong> that average <strong>the</strong> deviation value is more at some credible, points while (e.g. point <strong>the</strong> feelings 7) is relatively <strong>of</strong> small,<br />

which pedestrian means <strong>the</strong> at average point 5 and value 6 are is more more credible, disperse. while <strong>the</strong> feelings <strong>of</strong> pedestrian at point 5 and 6 are<br />

more disperse.<br />

<strong>Data</strong> Analysis Method<br />

<strong>Data</strong> Analysis Method<br />

<strong>Data</strong> analysis is divided into 4 steps.<br />

<strong>Data</strong> analysis 1. Average is divided value into 4 steps.<br />

In first step, I compare <strong>the</strong> calculated 3 average values <strong>of</strong> survey <strong>results</strong> with 3 street cross<br />

section properties (street width, parking lane fraction and side walk fraction) in turn, to get<br />

1. Average value<br />

<strong>the</strong> first impression <strong>of</strong> <strong>the</strong> relation ship between <strong>the</strong>m.<br />

In first step, 2. Cluster I compare <strong>the</strong> calculated 3 average values <strong>of</strong> survey <strong>results</strong> with 3 street cross section<br />

properties (street width, parking lane fraction and side walk fraction) in turn, to get <strong>the</strong> first impression<br />

<strong>of</strong> <strong>the</strong> relation I choose ship k-mean between as <strong>the</strong>m. cluster method to analyze <strong>the</strong> relationship between 3 average<br />

values <strong>of</strong> survey <strong>results</strong> with 3 street cross section properties. In this step, some semiquantitative<br />

<strong>results</strong> can be 2. Cluster<br />

concluded.<br />

I choose 3. k-mean Multi-variable as <strong>the</strong> linear cluster regression method to analyze <strong>the</strong> relationship between 3 average values <strong>of</strong><br />

survey <strong>results</strong> with 3 street cross section properties. In this step, some semiquantitative <strong>results</strong> can be<br />

concluded. In multi-variable regression, two more binary variables (trees and zebra crossing) are also<br />

taken into consideration. For each survey result, we will get a formula consisting <strong>of</strong> 5 street<br />

3. Multi-variable cross section linear variables. regression And according to its coefficient, we can get which variable’s effect is<br />

more significate.<br />

In multi-variable regression, two more binary variables (trees and zebra crossing) are also taken into<br />

consideration. 4. Simple For each linear survey regression result, we will get a formula consisting <strong>of</strong> 5 street cross section variables.<br />

And according to its coefficient, we can get which variable’s effect is more significate.<br />

Once we get <strong>the</strong> most import factor for each survey result, we can do <strong>the</strong> simple linear<br />

4. Simple regression linear to regression test its significance and credibility again.<br />

Once we get <strong>the</strong> most import factor for each survey result, we can do <strong>the</strong> simple linear regression to<br />

test its significance and credibility again.<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects 3<br />

28<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation


Results & Discussion<br />

Results & Discussion<br />

1. Average value<br />

Figure 3 comparison between survey result and street cross section property<br />

According to <strong>the</strong> diagrams, it is more or less clear that <strong>the</strong>re is correspondent relationship between side<br />

walk fraction and pedestrian’s feeling in all 3 survey <strong>results</strong> (like, ordered and secure). The biggest<br />

difference occurs at point 5. But point 5 is an intersection, and <strong>the</strong> street cross data can only describe on<br />

According to <strong>the</strong> diagrams, it is more or less clear that <strong>the</strong>re is correspondent relationship between<br />

street. That could be <strong>the</strong> reason to explain that big gap. Additionally, <strong>the</strong> deviation <strong>of</strong> survey <strong>results</strong> at<br />

side walk fraction and pedestrian’s feeling in all 3 survey <strong>results</strong> (like, ordered and secure). The biggest<br />

point 5 is relatively big, it could also result in this difference.<br />

difference occurs at point 5. But point 5 is an intersection, and <strong>the</strong> street cross data can only describe<br />

on street. That could be <strong>the</strong> reason to explain that big gap. Additionally, <strong>the</strong> deviation <strong>of</strong> survey <strong>results</strong><br />

at point 5 is relatively big, it could also result in this difference.<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects 4<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation<br />

29


2. Cluster<br />

Figure 4 cluster analysis<br />

Due to <strong>the</strong> limitation <strong>of</strong> <strong>the</strong> amount <strong>of</strong> data set, <strong>the</strong> <strong>results</strong> <strong>of</strong> cluster analysis are not very<br />

significant. But we can still get <strong>the</strong> similar semi-quantitative result that low side walk fraction<br />

will result in more negative feeling, while <strong>the</strong> pedestrian is more likely to have a positive<br />

Figure<br />

evaluation<br />

4 cluster<br />

for those<br />

analysis<br />

points which have high side walk fraction.<br />

Due In additional, to <strong>the</strong> limitation <strong>the</strong>re is <strong>of</strong> also <strong>the</strong> amount some obvious <strong>of</strong> data cluster set, <strong>the</strong> between <strong>results</strong> <strong>of</strong> parking cluster lane analysis fraction are not and very secure. significant. It<br />

But<br />

seems<br />

we<br />

that<br />

can<br />

higher<br />

still get<br />

parking<br />

<strong>the</strong> similar<br />

lane<br />

semi-quantitative<br />

fraction leads to<br />

result<br />

an insecure<br />

that low<br />

feeling<br />

side walk<br />

for pedestrian.<br />

fraction will result in more<br />

negative feeling, while <strong>the</strong> pedestrian is more likely to have a positive evaluation for those points<br />

which If <strong>the</strong> experiment have high side could walk have fraction. more participants and cover more streets, cluster analysis will<br />

have more credible conclusion.<br />

In additional, <strong>the</strong>re is also some obvious cluster between parking lane fraction and secure. It seems<br />

that higher parking lane fraction leads to an insecure feeling for pedestrian. If <strong>the</strong> experiment could<br />

have more participants and cover more streets, cluster analysis will have more credible conclusion.<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects 5<br />

30<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation


3. Multi-variable linear regression<br />

Like =<br />

– 0.046 * street width<br />

+ 0.312 * sidewalk fraction<br />

– 1.350 * parking lane fraction<br />

+ 0.259 * trees<br />

– 0.600 * zebra crossing<br />

Figure 5 multi-variable linear regression for 'like or dislike'<br />

Ordered =<br />

+ 0.046 * street width<br />

+ 5.898 * sidewalk fraction<br />

– 0.293 * parking lane fraction<br />

+ 0.383 * trees<br />

– 0.503 * zebra crossing<br />

Figure 6 multi-variable linear regression for ‘ordered or chaotic’<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects 6<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation<br />

31


Secure =<br />

– 0.008 * street width<br />

+ 2.269 * sidewalk fraction<br />

Secure =<br />

0.720 parking lane fraction<br />

– 0.008 * street width<br />

0.208 trees<br />

+ 2.269 * sidewalk fraction<br />

0.521 zebra crossing<br />

– 0.720 * parking lane fraction<br />

+ 0.208 * trees<br />

– 0.521 * zebra crossing<br />

Figure 7 multi-variable linear regression for ‘secure or insecure’<br />

Table 2 coefficients <strong>of</strong> multi-variable linear regression<br />

Figure 7 multi-variable linear regression for ‘secure or insecure’<br />

street width sidewalk fraction parking lane fraction trees zebra crossing<br />

like -0.046 0.312 -1.350 0.259 -0.600<br />

Table 2 coefficients <strong>of</strong> multi-variable linear regression<br />

ordered 0.046 5.898 -0.293 0.383 -0.503<br />

secure street width -0.008 sidewalk fraction 2.269 parking lane fraction -0.720 trees0.208 zebra crossing -0.521<br />

From<br />

like -0.046 0.312 -1.350 0.259 -0.600<br />

<strong>the</strong> table and diagrams, we can get <strong>the</strong> conclusion that side walk fraction and trees have<br />

and<br />

ordered 0.046 5.898 -0.293 0.383 -0.503<br />

positive effect on pedestrian, because all <strong>the</strong> coefficients are positive. On <strong>the</strong> o<strong>the</strong>r hand,<br />

parking<br />

secure -0.008 2.269 -0.720 0.208 -0.521<br />

lane fraction and zebra crossing effect pedestrian in a more negative way.<br />

From <strong>the</strong> table and diagrams, we can get <strong>the</strong> conclusion that side walk fraction and trees have and<br />

positive From <strong>the</strong> effect table on and pedestrian, diagrams, because we can get all <strong>the</strong> coefficients conclusion are that positive. side walk On fraction <strong>the</strong> o<strong>the</strong>r and hand, trees parking have<br />

lane and positive fraction and effect zebra on pedestrian, crossing effect because pedestrian all <strong>the</strong> in coefficients a more negative are positive. way. On <strong>the</strong> o<strong>the</strong>r hand,<br />

parking<br />

4. Simple<br />

lane fraction<br />

linear regression<br />

and zebra crossing effect pedestrian in a more negative way.<br />

4. Simple linear regression<br />

Figure 8 simple linear regression<br />

Figure 8 simple linear regression<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects 7<br />

32<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation


From last analysis, we find <strong>the</strong> most important factor for each question, and test it with simple linear<br />

regression. According to <strong>the</strong> diagrams above, <strong>the</strong> negative effect <strong>of</strong> parking lane fraction on ‘like or<br />

dislike’ is not very significant. But <strong>the</strong> positive effect <strong>of</strong> side walk fraction is still credible.<br />

Conclusions<br />

After all <strong>the</strong>se analysis, we can now answer <strong>the</strong> questions at <strong>the</strong> beginning. The different elements<br />

<strong>of</strong> <strong>the</strong> street cross section effect pedestrian in a different way. We can find that parking lane fraction<br />

has a negative effect on pedestrian, while side walk fraction affect pedestrian in a more significantly<br />

positive way.<br />

But to get a more meaningful and credible <strong>results</strong>, we still need more data. Additionally, <strong>the</strong> binary<br />

variable should be avoided in multi-variable regression. For example, ‘trees’ variable could be<br />

replaced by green area. Or <strong>the</strong>re could be some o<strong>the</strong>r methods more suitable for binary variable<br />

analysis. That’s could be <strong>the</strong> fur<strong>the</strong>r step <strong>of</strong> this project.<br />

References<br />

Marc Schlossberg, John Rowell, Davd Amos, and Kelly Sanford, Rethinking Streets:<br />

An Evidence-Based Guide to 25 Complete Street Transformations, 2013.<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation<br />

33


34<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation


Security Analysis<br />

Student: Gong Chen & Zhonghau Dai<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation<br />

35


Motivation<br />

Security Analysis<br />

Security level is one important index to evaluate urban planning. Security level can be treated as a<br />

citizens’ comprehensive subjective reflection on urban surroundings. Therefore, based on <strong>the</strong> project<br />

ESUM- Name Analyzing trade-<strong>of</strong>fs between Energy and Social performance <strong>of</strong> Urban Morphologies,<br />

various datasets are available. Among <strong>the</strong>re datasets, survey questions are <strong>of</strong> most interest, which<br />

covers twelve questions (Figure 1), including security, noise, beauty evaluation and etc. Gong Hence, Chen<br />

<strong>the</strong>re should be a best combination <strong>of</strong> <strong>the</strong>se parameters to provide us a high urban security.<br />

Zhonghao Dai<br />

Motivation<br />

Security level is one important index to evaluate urban planning. Security level can be treated as a<br />

citizens’ comprehensive subjective reflection on urban surroundings. Therefore, based on <strong>the</strong> project<br />

ESUM- Analyzing trade-<strong>of</strong>fs between Energy and Social performance <strong>of</strong> Urban Morphologies, various<br />

datasets are available. Among <strong>the</strong>re datasets, survey questions are <strong>of</strong> most interest, which covers twelve<br />

questions (Figure 1), including security, noise, beauty evaluation and etc. Hence, <strong>the</strong>re should be a best<br />

combination <strong>of</strong> <strong>the</strong>se parameters to provide us a high urban security.<br />

Figure 1: Survey questions<br />

Hypo<strong>the</strong>sis & Research Question(s)<br />

Hypo<strong>the</strong>sis<br />

Following are questions<br />

& Research<br />

conveyed<br />

Question(s)<br />

during ESUM project, dislike/like, chaotic/ordered, noisy/quiet,<br />

private/public, boring/interesting, crowded/empty, insecure/secure, ugly/beautiful, narrow/spacious,<br />

Following<br />

enclosed/open,<br />

are questions<br />

dark/light,<br />

conveyed<br />

unfamiliar/familiar,<br />

during ESUM<br />

among<br />

project,<br />

which insecure/secure<br />

dislike/like, chaotic/ordered,<br />

is response variable<br />

noisy/quiet,<br />

private/public, focused during boring/interesting, this project, all <strong>the</strong>se crowded/empty, variables are evaluated insecure/secure, -2, -1, 0, 1, ugly/beautiful, 2. narrow/spacious,<br />

enclosed/open, dark/light, unfamiliar/familiar, among which insecure/secure is response variable<br />

focused during this project, all <strong>the</strong>se variables are evaluated in -2, -1, 0, 1, 2.<br />

Based on intuition, when a place is quite, it will make people feel secure while when it goes noisy,<br />

Based people on tend intuition, to feel when upset and a place become is quite, feeling it unsecure. will make In people this case, feel one secure hypo<strong>the</strong>sis while is made when that it goes noisy,<br />

people secure/insecure tend to feel has upset a negative and (because become <strong>the</strong> feeling high level unsecure. in questionnaire In this case, means one quiet) hypo<strong>the</strong>sis relationship is made with that<br />

secure/insecure noisy/quiet a has wide a range. negative And, (because by searching <strong>the</strong> high such level relationship, in questionnaire a best range means <strong>of</strong> noise quiet) are relationship intended to be with<br />

noisy/quiet found. in a wide range. And, by searching such relationship, a best range <strong>of</strong> noise are intended to<br />

be found.<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects 1<br />

36<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation


Besides, ano<strong>the</strong>r hypo<strong>the</strong>sis made in project is that people’s evaluation truly reflects noise level in<br />

decibel.<br />

Approach & Methods<br />

Since dataset is continuous and labeled, regression method is applied in project. Main procedures<br />

carried in project are listed following:<br />

1. Besides, Decided ano<strong>the</strong>r wea<strong>the</strong>r hypo<strong>the</strong>sis to use made noise in in project decibel is level that people’s or in evaluation evaluation level truly in survey reflects questions, noise level i.e. in does<br />

people’s decibel. evaluation on noise level truly corresponds to decibel level<br />

2. Searching potential variables related with security, both <strong>from</strong> visualization <strong>of</strong> data and linear<br />

regression method, i.e. single variable liner regression<br />

Approach & Methods<br />

3. Since Do multivariable dataset is continuous linear regression and labeled, to regression acquire a method simple is linear applied regression in project. model Main to procedures quantify security<br />

level carried in project are listed following:<br />

4. Check 1. Decided if <strong>the</strong> above wea<strong>the</strong>r linear to use model noise can in decibel simplified level or fur<strong>the</strong>r in evaluation to get level <strong>the</strong> most in survey crucial questions, parameters i.e. does<br />

people’s evaluation on noise level truly corresponds to decibel level<br />

2. Searching potential variables related with security, both <strong>from</strong> visualization <strong>of</strong> data and linear<br />

regression method, i.e. single variable liner regression<br />

3. Do multivariable linear regression to acquire a simple linear regression model to quantify<br />

Results & Discussion<br />

security level<br />

4. Check if <strong>the</strong> above linear model can be simplified fur<strong>the</strong>r to get <strong>the</strong> most crucial parameters<br />

1. Noise in decibel level versus evaluation level Boxplot <strong>of</strong> decibel level and evaluation level are<br />

both shown below (Figure 2 and Figure 3). Although boxplot provides more information e.g. outlier,<br />

percentile, it is hart to tell relationship.<br />

Results & Discussion<br />

1. Noise in decibel level versus evaluation level<br />

Boxplot <strong>of</strong> decibel level and evaluation level are both shown below (Figure 2 and Figure 3).<br />

Although boxplot provides more information e.g. outlier, percentile, it is hart to tell relationship.<br />

Figure 2: Decibel level<br />

Figure 3: Noise level<br />

Then two boxplots are shown in Figure 4 in <strong>the</strong> same axis, and mean values <strong>of</strong> <strong>the</strong>se two data are<br />

shown in Figure 5. Important thing here is that in question survey, -2 means noisy and 2 means<br />

Then two<br />

quite,<br />

boxplots<br />

so noise<br />

are<br />

level<br />

shown<br />

is replaced<br />

in Figure<br />

with<br />

4 in<br />

its<br />

<strong>the</strong><br />

opposite<br />

same<br />

number.<br />

axis, and<br />

Basically,<br />

mean values<br />

<strong>from</strong> Figure<br />

<strong>of</strong> <strong>the</strong>se<br />

4 & 5,<br />

two data are<br />

shown in<br />

evaluation<br />

Figure 5. Important<br />

on noisy/quite<br />

thing<br />

level<br />

here<br />

corresponds<br />

is that in question<br />

to decibel<br />

survey,<br />

level obtained<br />

-2 means<br />

by sensor.<br />

noisy and 2 means quite,<br />

so noise level is replaced with its opposite number. Basically, <strong>from</strong> Figure 4 & 5, evaluation on noisy/<br />

quite level corresponds to decibel level obtained by sensor.<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation<br />

37


Figure 4: Decibel versus noise level<br />

2. Potential variable searching<br />

2.1 Visualization<br />

2. Potential variable searching<br />

Figure 5: Mean value<br />

Following figures show mean value <strong>of</strong> 11 questions mentioned above versus security evaluation<br />

2.1 Visualization<br />

level at 14 path points (Figure 6), <strong>from</strong> which <strong>the</strong> most related variables are selected: dislike/like,<br />

Following figures show mean value <strong>of</strong> 11 questions mentioned above versus security evaluation level<br />

at 14 path points (Figure 6), <strong>from</strong> which <strong>the</strong> most related variables are selected: dislike/like, chaotic<br />

ordered, <strong>Creative</strong> noisy/quiet, <strong>Data</strong> <strong>Mining</strong> | boring/interesting, <strong>Spring</strong> <strong>2017</strong> | Final Projects crowded/empty, ugly/beautiful, which means <strong>the</strong>se variables 3<br />

determines insecure/secure level significantly.<br />

38<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation


chaotic/ordered, noisy/quiet, boring/interesting, crowded/empty, ugly/beautiful, which means<br />

<strong>the</strong>se variables determines insecure/secure level significantly.<br />

Figure 6: Mean value <strong>of</strong> each path point<br />

Also, scatter plot methods can be employed here to show linear relationship between <strong>the</strong>se<br />

variables and security more clearly, for example shown in figure 7.<br />

Also, scatter plot methods can be employed here to show linear relationship between <strong>the</strong>se variables<br />

and security more clearly, for example shown in figure 7.<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects 4<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation<br />

39


Figure 7: Scatter plot<br />

2.2 Single variable linear regression<br />

2.2 Single Apart variable <strong>from</strong> mean linear value visualization regression <strong>of</strong> 14 path points, single variable linear regression is also<br />

applied here to find <strong>the</strong> coefficient between each <strong>of</strong> <strong>the</strong>se 11 survey questions and security<br />

Apart <strong>from</strong> level. mean In value <strong>the</strong>se visualization case, all datasets <strong>of</strong> 14 are path employed points, instead single <strong>of</strong> variable mean values linear only regression in section is also 2.1. applied The<br />

here to find <strong>results</strong> <strong>the</strong> coefficient toge<strong>the</strong>r with between coefficient each and <strong>of</strong> R-square <strong>the</strong>se 11 values survey are questions shown in Figure and security 8. level. In <strong>the</strong>se<br />

case, all datasets are employed instead <strong>of</strong> mean values only in section 2.1. The <strong>results</strong> toge<strong>the</strong>r with<br />

coefficient The and same R-square result values is given: are dislike/like, shown in Figure chaotic/ordered, 8. noisy/quiet, boring/interesting,<br />

crowded/empty, ugly/beautiful, <strong>the</strong>se 6 variables are most likely to determine insecure/secure<br />

The same result<br />

level.<br />

is given: dislike/like, chaotic/ordered, noisy/quiet, boring/interesting, crowded/empty,<br />

ugly/beautiful, <strong>the</strong>se 6 variables are most likely to determine insecure/secure level.<br />

Notes: Since <strong>the</strong>re are only 5*5 combinations in each subfigure, overlaps are inevitable, thus<br />

Notes: Since <strong>the</strong>re are only 5*5 combinations in each subfigure, overlaps are inevitable, thus bigger <strong>the</strong><br />

bigger <strong>the</strong> circle is, more overlaps <strong>the</strong>re are.<br />

circle is, more overlaps <strong>the</strong>re are.<br />

3. Multivariable linear regression<br />

First, linear regression with 6 dependent variables and security level as response variable is<br />

accomplished with its corresponding model:<br />

3. Multivariable linear regression<br />

First, linear<br />

Security<br />

regression<br />

level<br />

with 6 dependent variables and security level as response variable is accomplished<br />

with its corresponding model:<br />

= 0.53 + 0.0685*noise + 0.2211* order + 0.1121*beauty + 0.0975*density+<br />

0.1670*like+ 0.1473* interesting (1)<br />

Security level<br />

= 0.53 +<br />

Note:<br />

0.0685*noise<br />

Abbreviation<br />

+ 0.2211*<br />

is used<br />

order<br />

here:<br />

+<br />

like<br />

0.1121*beauty<br />

= dislike/like,<br />

+<br />

order<br />

0.0975*density+<br />

= chaotic/ordered, noise =<br />

0.1670*like+ 0.1473* interesting (1)<br />

noisy/quiet, interesting = boring/interesting, density = crowded/empty, beauty = ugly/beautiful<br />

In this case, <strong>the</strong> R-squared value is 0.323, which is satisfying since when do linear regression <strong>of</strong><br />

Note: Abbreviation is used here: like = dislike/like, order = chaotic/ordered, noise =<br />

all o<strong>the</strong>r parameters i.e. all questions except security level, <strong>the</strong> value is 0.339. The accuracy is<br />

noisy/quiet, interesting = boring/interesting, density = crowded/empty, beauty = ugly/beautiful.<br />

not destroyed while 5 parameters are dropped, making <strong>the</strong> model much simple and clear.<br />

In this case, <strong>the</strong> R-squared value is 0.323, which is satisfying since when do linear regression <strong>of</strong> all o<strong>the</strong>r<br />

parameters i.e. all questions except security level, <strong>the</strong> value is 0.339. The accuracy is not destroyed<br />

while 5 parameters are dropped, making <strong>the</strong> model much simple and clear.<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects 5<br />

40<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation


<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects 6<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation<br />

41


Figure 8: Single variable linear regression<br />

From this equation, From this noise equation, level does noise not level have does much not to have do much with security to do with compared security compared with resting with parameters, resting<br />

in next step, parameters, feasibility <strong>of</strong> in model next step, simplification feasibility <strong>of</strong> will model be checked. simplification will be checked.<br />

4. Model simplification<br />

4. Model simplification<br />

In this section, <strong>the</strong> model is check to see if <strong>the</strong> above linear model can be simplified fur<strong>the</strong>r to get<br />

<strong>the</strong> most crucial parameters, basically, R-squared mean is applied to check <strong>the</strong> accuracy <strong>of</strong><br />

In this section, model. <strong>the</strong> model In this case, is check <strong>the</strong> noise to see is dropped, if <strong>the</strong> above and <strong>the</strong> linear new model can is followed: be simplified fur<strong>the</strong>r to get <strong>the</strong><br />

most crucial parameters, basically, R-squared mean is applied to check <strong>the</strong> accuracy <strong>of</strong> model. In this<br />

case, <strong>the</strong> noise Security is dropped, level and <strong>the</strong> new model is followed:<br />

= 0.49 + 0.2285* order + 0.1312*beauty + 0.1209*density+ 0.1312*like+ 0.1469*<br />

interesting (2)<br />

Security level<br />

Meanwhile, R value drops only by 0.002 i.e. 0.625%, so this model is still effective.<br />

= 0.49 + 0.2285* order + 0.1312*beauty + 0.1209*density+ 0.1312*like+ 0.1469* interesting (2)<br />

Meanwhile, Conclusions R value drops only by 0.002 i.e. 0.625%, so this model is still effective.<br />

According to equation (1), <strong>the</strong> security level does have relation with noise, but a positive coefficient is<br />

shown in equation (2) which means people feels more secure when it is quiet. So a wrong hypo<strong>the</strong>sis is<br />

made although security and noise level are correlated, but <strong>the</strong> relationship is negative. However,<br />

Conclusions<br />

According to equation (1), <strong>the</strong> security level does have relation with noise, but a positive coefficient is<br />

shown in equation (2) which means people feels more secure when it is quiet. So a wrong hypo<strong>the</strong>sis<br />

<strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | <strong>Spring</strong> <strong>2017</strong> | Final Projects 7<br />

is made although security and noise level are correlated, but <strong>the</strong> relationship is negative. However,<br />

compared with o<strong>the</strong>r five parameters in equation (2), noise level is negligible and order level, i.e. <strong>the</strong><br />

place is chaotic or ordered, is dominates in security level. If security level is intended to be increased,<br />

one should make <strong>the</strong> place more ordered.<br />

42<br />

New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation


New Methods in <strong>Creative</strong> <strong>Data</strong> <strong>Mining</strong> | Final project documentation<br />

43


Chair <strong>of</strong><br />

Information<br />

Architecture

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!