28.06.2014 Views

Generating Music Through Image Analysis - the Scientia Review

Generating Music Through Image Analysis - the Scientia Review

Generating Music Through Image Analysis - the Scientia Review

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Music</strong> Generation <strong>Through</strong> <strong>Image</strong>s<br />

<strong>Generating</strong> <strong>Music</strong> <strong>Through</strong> <strong>Image</strong> <strong>Analysis</strong><br />

Gregory Granito<br />

Massachusetts Academy of Math and Science<br />

Abstract<br />

In order to create a bridge between visual perception and aural perception, emotions were<br />

used as a link to convert a purely visual form of art into a purely audio form. The conversion is<br />

intended to leave <strong>the</strong> emotional essence of <strong>the</strong> work unsca<strong>the</strong>d. <strong>Through</strong>out <strong>the</strong> duration of <strong>the</strong><br />

project, Wolfram Ma<strong>the</strong>matica served as <strong>the</strong> development language and environment for <strong>the</strong><br />

creation of <strong>the</strong> program. The software developed requires <strong>the</strong> input of a digital representation of<br />

artwork, a digital image, and outputs audible music that is largely related to <strong>the</strong> original image.<br />

To perform such a conversion, <strong>the</strong> image is analyzed to isolate valence and arousal values. With<br />

<strong>the</strong> use of Thayer’s Emotional Plane, <strong>the</strong>se values can be mapped to feelings that all humans<br />

experience, <strong>the</strong>reby capturing <strong>the</strong> intrinsic emotions present in <strong>the</strong> image are recognized. Then<br />

<strong>the</strong> same feelings are generated in music based on <strong>the</strong> principles of music <strong>the</strong>ory.<br />

Introduction<br />

Sensory perception, for most, is a modal experience, which means that <strong>the</strong> responses of one<br />

sense to a stimulus do not directly affect <strong>the</strong> responses of ano<strong>the</strong>r sense. Light cannot be heard<br />

and sound cannot be seen. To be able to break this modality and perceive a stimulus in multiple<br />

ways could be a preface to new revolutionary technology to aid patients who have some type of<br />

sensory deficit. The focus of this project is to develop a program that will specifically break <strong>the</strong><br />

barrier between vision and hearing by transforming a digital image into correlating music.<br />

Emotion provides a basis for such a conversion because both images and music have underlying<br />

emotions present in <strong>the</strong>m. As such, this project attempts to use <strong>the</strong> emotions elicited by an image<br />

to create music that is linked to it.<br />

Literature <strong>Review</strong><br />

Emotional Response to Color Stimuli<br />

People react to colored stimuli differently depending on <strong>the</strong> hue that is shown to <strong>the</strong>m. Each<br />

pigment has a set of feelings that tend to emerge when that color is recognized by <strong>the</strong> brain.<br />

While for any emotion, it is possible for multiple colors to cause it to be felt. There are distinctly<br />

observable responses that occur most often for each color. Red invokes protective and defensive<br />

reactions, and orange is exciting. Yellow is also exciting, but it is also cheerful and jovial. Green<br />

is unusual in that it does not strongly correlate with any response. The colors blue and brown<br />

cause pleasant and secure feelings. Stately and dignified emotions arise when <strong>the</strong> stimuli is<br />

1


<strong>Music</strong> Generation <strong>Through</strong> <strong>Image</strong>s<br />

purple. White is calm and tender, but black is unhappy, disturbed, and danger. Grey causes<br />

emotions of boredom and melancholy. Each hue can stimulate <strong>the</strong> arousal of emotions, and<br />

nearly every emotion is associated with a color (Laurier et al., 2009)<br />

Figure 1. Emotion based on arousal and<br />

valence. The graph shows <strong>the</strong> emotion based<br />

on valence and arousal ratios (Laurier et al.,<br />

2009).<br />

Figure 1 shows that scientists can express numerous emotions with <strong>the</strong> use of solely 2 values,<br />

valence and arousal. Valence is whe<strong>the</strong>r <strong>the</strong> emotion is a good emotion, such as feeling happy, or<br />

a bad emotion, such as feeling sad. Arousal is how strong <strong>the</strong> emotions are felt and can relate to<br />

how much energy <strong>the</strong> subject has (Schaie, 1961). Scientists have been able to associate every<br />

emotion depicted in figure 1 to a corresponding valence and arousal (Laurier et al. 2009).<br />

<strong>Music</strong> Theory and <strong>Music</strong>al Notation<br />

<strong>Music</strong> is full of emotion, which is <strong>the</strong> reason music is appealing. Numerically identifying <strong>the</strong><br />

mood of a music piece has been a difficult science. Researchers hope to match sections of music<br />

up with <strong>the</strong> emotions associated with those patterns. By identifying common musical patterns<br />

that invoke emotion, investigators should be able to search any musical number for those<br />

patterns <strong>the</strong>reby identifying <strong>the</strong> mood of <strong>the</strong> piece. Once <strong>the</strong> mood of a piece can be identified, it<br />

can be matched to people who enjoy that type of music. People prefer certain types of music<br />

because <strong>the</strong>y cause certain emotions. By isolating <strong>the</strong>se emotions, scientists can match a person<br />

up with <strong>the</strong> music <strong>the</strong>y favor (Jun, 2010).<br />

Studies have based mood classifications on audio files alone. However, lyrics also play a<br />

large role in <strong>the</strong> emotion of a piece. Certain words inspire emotions. The effect of lyrics on <strong>the</strong><br />

mood of a musical number had been ignored but recently <strong>the</strong>re have been studies that have<br />

looked into <strong>the</strong> feelings that lyrics inspire, but to accurately classify <strong>the</strong> emotions in a musical<br />

2


<strong>Music</strong> Generation <strong>Through</strong> <strong>Image</strong>s<br />

selection, both lyrics and musical sounds need to be taken into consideration. Using both aspects<br />

of music in <strong>the</strong> analysis led to generally more accurate results (Hu, 2010).<br />

Western common music notation system, or CMN, is one of <strong>the</strong> most common musical<br />

notations. CMN uses <strong>the</strong> location of symbols on a 5 line staff to determine pitch. In this system<br />

<strong>the</strong>re are 12 notes: C, C♯ , D, D♯ , E, F, F♯ , G, G♯ , A, A♯ , and B. Each note is higher than <strong>the</strong><br />

previous. Sharps, notes with a #, are halfway between <strong>the</strong> unsharpened note and <strong>the</strong> note that is<br />

one higher that it. There are no notes between <strong>the</strong> notes E and <strong>the</strong> notes F and B and C. Each of<br />

<strong>the</strong>se notes can also be raised or lowered to a different octave (“<strong>Music</strong>al Sound”, 2010). A note<br />

in an octave is has a frequency twice as high as <strong>the</strong> same note in <strong>the</strong> octave directly beneath it.<br />

A4 (<strong>the</strong> standard for musical notes from which all o<strong>the</strong>rs are derived) has a frequency of 440 Hz.<br />

The octave above that must have a frequency of 880Hz and <strong>the</strong> octave below it must have a<br />

frequency of 220 Hz (Olson, 1967). Generally, any octave of a note can serve <strong>the</strong> same purpose<br />

in a piece of music. The frequencies of all of <strong>the</strong> o<strong>the</strong>r notes besides A4 in CMN are based on <strong>the</strong><br />

equal-temperance scale. In this scale, an octave is divided into 12 intervals. The ratio between<br />

each consecutive note is always equal, hence <strong>the</strong> name equal-temperance (Loy, 2006).<br />

Different physical construction causes different musical instruments to be better suited to<br />

specific ranges of pitch. While <strong>the</strong>y are not always limited to this range, <strong>the</strong> following notes are<br />

<strong>the</strong> commonly used notes for some common instruments:<br />

Percussion: Piano A0-C8, Organ C1-B7, Bells F5-C8, Chimes C5-F6, Xylophone C4-<br />

C8, Vibraphone F3-C7, Marimba A2-C7, Timpani F2-F3<br />

Woodwind: Piccolo C5-A♯ 7, Flute C4-C7, Soprano Saxophone F3-D♯ 6, Alto<br />

Saxophone C♯ 3-G♯ 5, Tenor Saxophone G♯ 2-D♯ 5, Baritone Saxophone C♯ 2-D♯ 5, Bass<br />

Saxophone G♯ 1-D♯ 4, Soprano Clarinet D3-C♯ 6, Alto Clarinet G2-G♯ 5, Bass Clarinet D2-<br />

D♯ 5, Oboe A♯ 3-F6, English Horn F3-F5, Bassoon A♯ 1-D♯ 5<br />

Brass and Strings: Cornet/Trumpet F3-A♯ 5, French Horn B1-F5, Trombone/Euphonium<br />

E2-A♯ 4, Bass Tuba E1-A♯ 3, Guitar E2-F5, Harp C1-G7, Violin A♯ 3-C7, Viola C3-C6, Cello<br />

C2-E5, Bass E1-A♯ 3<br />

Different notes are classified into different voices. <strong>Music</strong> is usually composed of multiple<br />

parts, each with a different voice which covers a different section of <strong>the</strong> available range of sound.<br />

Soprano is <strong>the</strong> highest in average pitch, followed by alto, tenor, baritone, and bass respectively.<br />

These different instruments typically fall in <strong>the</strong> following ranges: Soprano C4-C6, Alto G3-G5,<br />

Tenor D3-A♯ 4, Baritone A2-G4, and Bass E2-D♯ 4<br />

(Pierce, 1992).<br />

<strong>Image</strong> Processing and Digital <strong>Image</strong> Data<br />

Scientists have had difficulty defining color; however, investigators define it as an attribute of<br />

optical perception. Researchers have tried to come up with a better explanation but describing<br />

<strong>the</strong> term has proved very difficult and <strong>the</strong> results have often been unsatisfying. Because color is a<br />

property of <strong>the</strong> visual experience, it can inspire emotions (Sharma, 2006).<br />

Modern images comprise numerous square pixels. In a computer all images must contain pixels,<br />

from stored images in memory to images captured by cameras to images displayed by <strong>the</strong><br />

monitor. A pixel can be any color from <strong>the</strong> list of 16,777,216 available colors, which computers<br />

identify by a 6-digit hexadecimal code. The pixels act like pieces of a mosaic; <strong>the</strong>re are<br />

numerous of <strong>the</strong>m in different colors and <strong>the</strong> becomes apparent when <strong>the</strong> image as a whole is<br />

examined. Computers can represent any image by using enough pixels (Kuehni, 2005).<br />

3


<strong>Music</strong> Generation <strong>Through</strong> <strong>Image</strong>s<br />

Digital images can be stored in many different formats. The simplest form is a bitmap. This<br />

simply contains <strong>the</strong> data for <strong>the</strong> color of each individual pixel and is organized in a way that<br />

indicates <strong>the</strong> location of <strong>the</strong> pixel. The major problem with this method is that it requires <strong>the</strong><br />

storage and handling of vast quantities of data. This problem causes computers to have difficulty<br />

moving, working with and saving <strong>the</strong> images. To circumvent this problem, computer scientists<br />

developed compressed formats for image storage. The development of new formats allowed<br />

large images to be stored in a much smaller amount of data. The most prevalent method of image<br />

compression is <strong>the</strong> JPEG format. Compression relies on grouping areas of similar pixel color and<br />

storing rules <strong>the</strong> computer can follow to reproduce a nearly identical image. The largest problem<br />

with this method is it that it is only able to store an image with 95% accuracy. In addition, <strong>the</strong><br />

quality can be reduced to fur<strong>the</strong>r compress <strong>the</strong> image. By compressing images, quality can be<br />

significantly reduced (Neelamani, 2006).<br />

<strong>Image</strong> processing is a technique used by computer scientists to analyze images through <strong>the</strong> use of<br />

computers. This can be both difficult and processor intensive. Different means can be used to<br />

search for patterns in images. The computer can compare average color, it can look for large<br />

changes in color, it can break colors into components and analyze each component differently,<br />

and it can analyze images in any o<strong>the</strong>r way it is programmed. <strong>Image</strong> processing can be difficult<br />

to do, but if done correctly, it can be an extremely powerful tool (Sharma, 2010).<br />

<strong>Image</strong> Processing in Ma<strong>the</strong>matica<br />

Wolfram Research has created Ma<strong>the</strong>matica, a software tool that helps ma<strong>the</strong>maticians do<br />

complex ma<strong>the</strong>matics. In addition to performing math, it has been adapted so that it offers a<br />

programming language similar to several o<strong>the</strong>r programming options. Ma<strong>the</strong>matica offers unique<br />

ways to handle and analyze images. For instance, it is very easy to import images; <strong>the</strong>y can<br />

simply be dragged into <strong>the</strong> notebook file. Once <strong>the</strong>y are in <strong>the</strong> document, <strong>the</strong>re are numerous<br />

functions available, ranging from partitioning an image to grouping images by color schemes.<br />

<strong>Image</strong> partition, a function Wolfram Research has provided, takes an image and a partition size.<br />

Ma<strong>the</strong>matica <strong>the</strong>n returns a list of lists of smaller images. These smaller images are squares with<br />

a side length equal to <strong>the</strong> partition size specified. This can be very useful, allowing <strong>the</strong> image to<br />

be broken down into usable sections, helping simplify <strong>the</strong> image analysis process. This function<br />

along with o<strong>the</strong>r functions that help with color analysis allows a program to process images with<br />

relative ease.<br />

<strong>Music</strong> and Sound<br />

According to Sarah Rutkiewicz, an expert in music <strong>the</strong>ory and musical education, music<br />

is organized according to pitch, rhythm, or harmony. This organization leads to an artistic<br />

element present in <strong>the</strong> arrangement that <strong>the</strong> performer can reproduce. However, noise is random,<br />

meaningless, accidental, and lacks <strong>the</strong> essential artistic element. While not all noise is music,<br />

music can be considered a more specific type of noise (personal communication)<br />

4


<strong>Music</strong> Generation <strong>Through</strong> <strong>Image</strong>s<br />

Research Plan<br />

A. Researchable Question or engineering problem being addressed<br />

The goal of this project is to design a program that uses an image as input to generate music that<br />

is emotionally linked to <strong>the</strong> image.<br />

B. Hypo<strong>the</strong>sis/ Goals:<br />

The goal of <strong>the</strong> research and development is to code a program that uses an image as input to<br />

create music that is emotional linked to <strong>the</strong> image.<br />

C. Description in detail of methods or procedures:<br />

Ma<strong>the</strong>matica will be used to analyze inputted images and create music. The methods that will be<br />

used include finding <strong>the</strong> size, ratio of length to width, and average color value of <strong>the</strong> image;<br />

searching for areas of a large concentration of a color and searching for dramatic changes in hue;<br />

and looking for patterns regarding <strong>the</strong> number of times a certain color appears. Each one of <strong>the</strong>se<br />

properties of <strong>the</strong> image will correspond to a property of music, such as <strong>the</strong> key signature, time<br />

signature, tempo, instrumentation. The image will be partitioned into smaller sections that will<br />

correspond to measures of <strong>the</strong> music. The overall color value of <strong>the</strong>se smaller images will be<br />

used to determine <strong>the</strong> chord of <strong>the</strong> measure. Finally, <strong>the</strong> smaller sections will be broken again in<br />

smaller sections that will represent one thirty-second note. The brightness of <strong>the</strong>se portions will<br />

be used to determine whe<strong>the</strong>r <strong>the</strong>re is sound or silence during that time that <strong>the</strong> thirty-second<br />

note represents.<br />

Methodology<br />

Ma<strong>the</strong>matica 7.0 1.0 for Students from Wolfram Research was used to write a program<br />

that analyzes a digital image and generates audible music. An image was copied to <strong>the</strong> clipboard<br />

and pasted into <strong>the</strong> Ma<strong>the</strong>matica file next to <strong>the</strong> variable called image, and <strong>the</strong> program was<br />

subsequently run. After <strong>the</strong> music was generated, <strong>the</strong> command Export["SoundFile.mid”,<br />

Sound[data, time]] was run in Ma<strong>the</strong>matica, which created a MIDI file on <strong>the</strong> C drive of <strong>the</strong><br />

computer. This file was opened in Sibelius 6.1.0, which displayed <strong>the</strong> music as sheet music.<br />

Sibelius was used to print <strong>the</strong> sheet music.<br />

Results and Discussion<br />

The application resulting from this investigation begins with a path or a URL to an<br />

image. After importing <strong>the</strong> image, <strong>the</strong> overall valence and arousal of <strong>the</strong> graphic is determined<br />

using <strong>the</strong> average color and <strong>the</strong> average change in color from one pixel to <strong>the</strong> next. Using <strong>the</strong>se<br />

values to determine <strong>the</strong> average emotion present in <strong>the</strong> piece <strong>the</strong> program selects an appropriate<br />

key that correlates <strong>the</strong> best with that emotion by calculating <strong>the</strong> closest predefined point on<br />

Thayer’s Emotional plane. The emotion that lies <strong>the</strong> closes to <strong>the</strong> point representing <strong>the</strong> image<br />

that is being analyzed is used.<br />

After discerning <strong>the</strong> key, chord progressions are formed in a similar manner. The image<br />

is subdivided into numerous square partitions. Each partition is analyzed using <strong>the</strong> same process.<br />

The resulting vector on <strong>the</strong> Thayer Emotional Plane and <strong>the</strong> previously played chord are used to<br />

determine which chord is next in <strong>the</strong> progression; however, <strong>the</strong> program always ensures that <strong>the</strong><br />

chord matches appropriately with <strong>the</strong> surrounding music by comparing it to <strong>the</strong> previous chord.<br />

5


<strong>Music</strong> Generation <strong>Through</strong> <strong>Image</strong>s<br />

This process is repeated for every section of <strong>the</strong> image until a lengthy progression of chords has<br />

been constructed.<br />

Upon <strong>the</strong> completion of <strong>the</strong> chords, <strong>the</strong> sub-sections are reanalyzed in order to form a<br />

melody. This process looks at <strong>the</strong> valence and arousal values of even smaller subsections of<br />

pixels. This allows multiple melodic notes to be played with each chord. Before committing each<br />

note to <strong>the</strong> final sound, <strong>the</strong> note is checked against <strong>the</strong> rest of <strong>the</strong> notes being played at that time<br />

to ensure that <strong>the</strong> sound produced will be correctly in accordance with music <strong>the</strong>ory. After both<br />

<strong>the</strong> chords and <strong>the</strong> melody is established, <strong>the</strong> program <strong>the</strong>n complies it into a single music file<br />

that can be exported as a midi file and opened as sheet music in o<strong>the</strong>r programs such as Sibelius.<br />

Conclusions<br />

The generated music gives indications to <strong>the</strong> contents of <strong>the</strong> original image. <strong>Image</strong>s with<br />

lighter colors, such as yellow, light blue, light green and o<strong>the</strong>r colors associated with joyful<br />

images will produce cheerful music. Similarly, darker colored images will generate somber<br />

music. The same correlation exists for arousal values; images with great variances in arousal will<br />

create images that make drastic changes. <strong>Image</strong>s with little variance will be represented with<br />

very similar notes that seem to roll throughout <strong>the</strong> entirety of <strong>the</strong> piece. While <strong>the</strong> correlation<br />

between <strong>the</strong> source and <strong>the</strong> resulting music is not strong enough to indicate all information<br />

present in <strong>the</strong> image, it gives insight into <strong>the</strong> main <strong>the</strong>mes present in <strong>the</strong> art.<br />

Limitations and Assumptions<br />

For <strong>the</strong> prototype software to function properly, <strong>the</strong> assumptions that <strong>the</strong> valence of <strong>the</strong><br />

emotions present in <strong>the</strong> image are based solely on <strong>the</strong> average color and that <strong>the</strong> arousal values<br />

are based solely on <strong>the</strong> magnitude of <strong>the</strong> variation in colors. The program will also only work<br />

for Americans because Thayer’s Emotional Plane <strong>the</strong> program uses is based on people living in<br />

<strong>the</strong> United States. O<strong>the</strong>r planes are necessary for o<strong>the</strong>r locations around <strong>the</strong> world. In some<br />

images, <strong>the</strong> mood of <strong>the</strong> piece is not accurately portrayed by <strong>the</strong>se characteristics. The program<br />

was designed to take any digital image as input and base <strong>the</strong> outputted music solely on that data.<br />

For this reason, <strong>the</strong> input is uncontrollable, and <strong>the</strong> only controllable aspect is <strong>the</strong> process <strong>the</strong><br />

image undergoes.<br />

Applications and Future Experiments<br />

The work on image analysis is a breakthrough in inter-sensory interpretation, a technique<br />

that permits <strong>the</strong> perception of one sense to be perceived similarly through ano<strong>the</strong>r sense. The<br />

program resulting from this project will provide <strong>the</strong> ability not only to visualize an image, but<br />

also to experience it aurally. The next step in <strong>the</strong> project is to add seventh chords, a style of<br />

chord currently unsupported by <strong>the</strong> program. This will allow <strong>the</strong> use of more complex chord<br />

progressions, which will eventually result in a more accurate aural image portrayal. To maximize<br />

<strong>the</strong> features in <strong>the</strong> program additional instruments and volume levels would have to be added.<br />

6


<strong>Music</strong> Generation <strong>Through</strong> <strong>Image</strong>s<br />

Literature Cited<br />

Hu, X. (2010). improving mood classification in music digital libraries by combining lyrics and<br />

audio. Association for Computing Machinery, Retrieved from<br />

http://xml.engineeringvillage2.org<br />

Jun, S. (2010). <strong>Music</strong> retrieval and recommendation scheme based on varying mood sequences.<br />

International Journal on Semantic Web and Information Systems, 6(2), Retrieved from<br />

http://find.galegroup.com<br />

Kuehni, R. G. (2005). Color an introduction to practice and principles (2 nd ed.). Hokoken, New<br />

Jersey: John Wiley & Sons, Inc.<br />

Laurier, C., Meyers, O., Serrá, J., Blech, M., Herrera, P., & Serra, X. (2009). Indexing music by<br />

mood: design and integration of an automatic content-based annotator. Multimedia Tools<br />

and Applications, 47(3), Retrieved from http://find.galegroup.com<br />

Loy, Gareth D. (2006). Musimathics: A Guided Tour of <strong>the</strong> Ma<strong>the</strong>matics of <strong>Music</strong>. (Vol. 1).<br />

Cambridge, MA: The MIT Press.<br />

<strong>Music</strong>al Sound. (2010). In Encyclopædia Britannica. Retrieved October 20, 2010, from<br />

Encyclopædia Britannica Online: http://www.britannica.<br />

com/EBchecked/topic/399266/musical-sound/64497/Pitch-and-timbre?anchor=ref529625<br />

Neelamani, R. (2006). Jpeg compression history estimation for color images. IEEE Transactions<br />

on <strong>Image</strong> Processing, 15(6), Retrieved from http://ieeexplore.ieee.org<br />

Olson, H. F. (1967). <strong>Music</strong>, physics, and engineering (2 nd ed.). New York: Dover Publications<br />

Inc.<br />

Schaie, K. W. (1961). Scaling <strong>the</strong> association between colors and mood-tones. The American<br />

Journal of Psychology, 74, 226-273.<br />

Sharma, G. (ed.). (2002). Digital color imaging handbook. CRC Press.<br />

7

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!