Generating Music Through Image Analysis - the Scientia Review

Music Generation Through Images 

Generating Music Through Image Analysis 

Gregory Granito 

Massachusetts Academy of Math and Science 

Abstract 

In order to create a bridge between visual perception and aural perception, emotions were 

used as a link to convert a purely visual form of art into a purely audio form. The conversion is 

intended to leave the emotional essence of the work unscathed. Throughout the duration of the 

project, Wolfram Mathematica served as the development language and environment for the 

creation of the program. The software developed requires the input of a digital representation of 

artwork, a digital image, and outputs audible music that is largely related to the original image. 

To perform such a conversion, the image is analyzed to isolate valence and arousal values. With 

the use of Thayer’s Emotional Plane, these values can be mapped to feelings that all humans 

experience, thereby capturing the intrinsic emotions present in the image are recognized. Then 

the same feelings are generated in music based on the principles of music theory. 

Introduction 

Sensory perception, for most, is a modal experience, which means that the responses of one 

sense to a stimulus do not directly affect the responses of another sense. Light cannot be heard 

and sound cannot be seen. To be able to break this modality and perceive a stimulus in multiple 

ways could be a preface to new revolutionary technology to aid patients who have some type of 

sensory deficit. The focus of this project is to develop a program that will specifically break the 

barrier between vision and hearing by transforming a digital image into correlating music. 

Emotion provides a basis for such a conversion because both images and music have underlying 

emotions present in them. As such, this project attempts to use the emotions elicited by an image 

to create music that is linked to it. 

Literature Review 

Emotional Response to Color Stimuli 

People react to colored stimuli differently depending on the hue that is shown to them. Each 

pigment has a set of feelings that tend to emerge when that color is recognized by the brain. 

While for any emotion, it is possible for multiple colors to cause it to be felt. There are distinctly 

observable responses that occur most often for each color. Red invokes protective and defensive 

reactions, and orange is exciting. Yellow is also exciting, but it is also cheerful and jovial. Green 

is unusual in that it does not strongly correlate with any response. The colors blue and brown 

cause pleasant and secure feelings. Stately and dignified emotions arise when the stimuli is 

1


purple. White is calm and tender, but black is unhappy, disturbed, and danger. Grey causes 

emotions of boredom and melancholy. Each hue can stimulate the arousal of emotions, and 

nearly every emotion is associated with a color (Laurier et al., 2009) 

Figure 1. Emotion based on arousal and 

valence. The graph shows the emotion based 

on valence and arousal ratios (Laurier et al., 

2009). 

Figure 1 shows that scientists can express numerous emotions with the use of solely 2 values, 

valence and arousal. Valence is whether the emotion is a good emotion, such as feeling happy, or 

a bad emotion, such as feeling sad. Arousal is how strong the emotions are felt and can relate to 

how much energy the subject has (Schaie, 1961). Scientists have been able to associate every 

emotion depicted in figure 1 to a corresponding valence and arousal (Laurier et al. 2009). 

Music Theory and Musical Notation 

Music is full of emotion, which is the reason music is appealing. Numerically identifying the 

mood of a music piece has been a difficult science. Researchers hope to match sections of music 

up with the emotions associated with those patterns. By identifying common musical patterns 

that invoke emotion, investigators should be able to search any musical number for those 

patterns thereby identifying the mood of the piece. Once the mood of a piece can be identified, it 

can be matched to people who enjoy that type of music. People prefer certain types of music 

because they cause certain emotions. By isolating these emotions, scientists can match a person 

up with the music they favor (Jun, 2010). 

Studies have based mood classifications on audio files alone. However, lyrics also play a 

large role in the emotion of a piece. Certain words inspire emotions. The effect of lyrics on the 

mood of a musical number had been ignored but recently there have been studies that have 

looked into the feelings that lyrics inspire, but to accurately classify the emotions in a musical 

2


selection, both lyrics and musical sounds need to be taken into consideration. Using both aspects 

of music in the analysis led to generally more accurate results (Hu, 2010). 

Western common music notation system, or CMN, is one of the most common musical 

notations. CMN uses the location of symbols on a 5 line staff to determine pitch. In this system 

there are 12 notes: C, C♯ , D, D♯ , E, F, F♯ , G, G♯ , A, A♯ , and B. Each note is higher than the 

previous. Sharps, notes with a #, are halfway between the unsharpened note and the note that is 

one higher that it. There are no notes between the notes E and the notes F and B and C. Each of 

these notes can also be raised or lowered to a different octave (“Musical Sound”, 2010). A note 

in an octave is has a frequency twice as high as the same note in the octave directly beneath it. 

A4 (the standard for musical notes from which all others are derived) has a frequency of 440 Hz. 

The octave above that must have a frequency of 880Hz and the octave below it must have a 

frequency of 220 Hz (Olson, 1967). Generally, any octave of a note can serve the same purpose 

in a piece of music. The frequencies of all of the other notes besides A4 in CMN are based on the 

equal-temperance scale. In this scale, an octave is divided into 12 intervals. The ratio between 

each consecutive note is always equal, hence the name equal-temperance (Loy, 2006). 

Different physical construction causes different musical instruments to be better suited to 

specific ranges of pitch. While they are not always limited to this range, the following notes are 

the commonly used notes for some common instruments: 

Percussion: Piano A0-C8, Organ C1-B7, Bells F5-C8, Chimes C5-F6, Xylophone C4- 

C8, Vibraphone F3-C7, Marimba A2-C7, Timpani F2-F3 

Woodwind: Piccolo C5-A♯ 7, Flute C4-C7, Soprano Saxophone F3-D♯ 6, Alto 

Saxophone C♯ 3-G♯ 5, Tenor Saxophone G♯ 2-D♯ 5, Baritone Saxophone C♯ 2-D♯ 5, Bass 

Saxophone G♯ 1-D♯ 4, Soprano Clarinet D3-C♯ 6, Alto Clarinet G2-G♯ 5, Bass Clarinet D2- 

D♯ 5, Oboe A♯ 3-F6, English Horn F3-F5, Bassoon A♯ 1-D♯ 5 

Brass and Strings: Cornet/Trumpet F3-A♯ 5, French Horn B1-F5, Trombone/Euphonium 

E2-A♯ 4, Bass Tuba E1-A♯ 3, Guitar E2-F5, Harp C1-G7, Violin A♯ 3-C7, Viola C3-C6, Cello 

C2-E5, Bass E1-A♯ 3 

Different notes are classified into different voices. Music is usually composed of multiple 

parts, each with a different voice which covers a different section of the available range of sound. 

Soprano is the highest in average pitch, followed by alto, tenor, baritone, and bass respectively. 

These different instruments typically fall in the following ranges: Soprano C4-C6, Alto G3-G5, 

Tenor D3-A♯ 4, Baritone A2-G4, and Bass E2-D♯ 4 

(Pierce, 1992). 

Image Processing and Digital Image Data 

Scientists have had difficulty defining color; however, investigators define it as an attribute of 

optical perception. Researchers have tried to come up with a better explanation but describing 

the term has proved very difficult and the results have often been unsatisfying. Because color is a 

property of the visual experience, it can inspire emotions (Sharma, 2006). 

Modern images comprise numerous square pixels. In a computer all images must contain pixels, 

from stored images in memory to images captured by cameras to images displayed by the 

monitor. A pixel can be any color from the list of 16,777,216 available colors, which computers 

identify by a 6-digit hexadecimal code. The pixels act like pieces of a mosaic; there are 

numerous of them in different colors and the becomes apparent when the image as a whole is 

examined. Computers can represent any image by using enough pixels (Kuehni, 2005). 

3


Digital images can be stored in many different formats. The simplest form is a bitmap. This 

simply contains the data for the color of each individual pixel and is organized in a way that 

indicates the location of the pixel. The major problem with this method is that it requires the 

storage and handling of vast quantities of data. This problem causes computers to have difficulty 

moving, working with and saving the images. To circumvent this problem, computer scientists 

developed compressed formats for image storage. The development of new formats allowed 

large images to be stored in a much smaller amount of data. The most prevalent method of image 

compression is the JPEG format. Compression relies on grouping areas of similar pixel color and 

storing rules the computer can follow to reproduce a nearly identical image. The largest problem 

with this method is it that it is only able to store an image with 95% accuracy. In addition, the 

quality can be reduced to further compress the image. By compressing images, quality can be 

significantly reduced (Neelamani, 2006). 

Image processing is a technique used by computer scientists to analyze images through the use of 

computers. This can be both difficult and processor intensive. Different means can be used to 

search for patterns in images. The computer can compare average color, it can look for large 

changes in color, it can break colors into components and analyze each component differently, 

and it can analyze images in any other way it is programmed. Image processing can be difficult 

to do, but if done correctly, it can be an extremely powerful tool (Sharma, 2010). 

Image Processing in Mathematica 

Wolfram Research has created Mathematica, a software tool that helps mathematicians do 

complex mathematics. In addition to performing math, it has been adapted so that it offers a 

programming language similar to several other programming options. Mathematica offers unique 

ways to handle and analyze images. For instance, it is very easy to import images; they can 

simply be dragged into the notebook file. Once they are in the document, there are numerous 

functions available, ranging from partitioning an image to grouping images by color schemes. 

Image partition, a function Wolfram Research has provided, takes an image and a partition size. 

Mathematica then returns a list of lists of smaller images. These smaller images are squares with 

a side length equal to the partition size specified. This can be very useful, allowing the image to 

be broken down into usable sections, helping simplify the image analysis process. This function 

along with other functions that help with color analysis allows a program to process images with 

relative ease. 

Music and Sound 

According to Sarah Rutkiewicz, an expert in music theory and musical education, music 

is organized according to pitch, rhythm, or harmony. This organization leads to an artistic 

element present in the arrangement that the performer can reproduce. However, noise is random, 

meaningless, accidental, and lacks the essential artistic element. While not all noise is music, 

music can be considered a more specific type of noise (personal communication) 

4


Research Plan 

A. Researchable Question or engineering problem being addressed 

The goal of this project is to design a program that uses an image as input to generate music that 

is emotionally linked to the image. 

B. Hypothesis/ Goals: 

The goal of the research and development is to code a program that uses an image as input to 

create music that is emotional linked to the image. 

C. Description in detail of methods or procedures: 

Mathematica will be used to analyze inputted images and create music. The methods that will be 

used include finding the size, ratio of length to width, and average color value of the image; 

searching for areas of a large concentration of a color and searching for dramatic changes in hue; 

and looking for patterns regarding the number of times a certain color appears. Each one of these 

properties of the image will correspond to a property of music, such as the key signature, time 

signature, tempo, instrumentation. The image will be partitioned into smaller sections that will 

correspond to measures of the music. The overall color value of these smaller images will be 

used to determine the chord of the measure. Finally, the smaller sections will be broken again in 

smaller sections that will represent one thirty-second note. The brightness of these portions will 

be used to determine whether there is sound or silence during that time that the thirty-second 

note represents. 

Methodology 

Mathematica 7.0 1.0 for Students from Wolfram Research was used to write a program 

that analyzes a digital image and generates audible music. An image was copied to the clipboard 

and pasted into the Mathematica file next to the variable called image, and the program was 

subsequently run. After the music was generated, the command Export["SoundFile.mid”, 

Sound[data, time]] was run in Mathematica, which created a MIDI file on the C drive of the 

computer. This file was opened in Sibelius 6.1.0, which displayed the music as sheet music. 

Sibelius was used to print the sheet music. 

Results and Discussion 

The application resulting from this investigation begins with a path or a URL to an 

image. After importing the image, the overall valence and arousal of the graphic is determined 

using the average color and the average change in color from one pixel to the next. Using these 

values to determine the average emotion present in the piece the program selects an appropriate 

key that correlates the best with that emotion by calculating the closest predefined point on 

Thayer’s Emotional plane. The emotion that lies the closes to the point representing the image 

that is being analyzed is used. 

After discerning the key, chord progressions are formed in a similar manner. The image 

is subdivided into numerous square partitions. Each partition is analyzed using the same process. 

The resulting vector on the Thayer Emotional Plane and the previously played chord are used to 

determine which chord is next in the progression; however, the program always ensures that the 

chord matches appropriately with the surrounding music by comparing it to the previous chord. 

5


This process is repeated for every section of the image until a lengthy progression of chords has 

been constructed. 

Upon the completion of the chords, the sub-sections are reanalyzed in order to form a 

melody. This process looks at the valence and arousal values of even smaller subsections of 

pixels. This allows multiple melodic notes to be played with each chord. Before committing each 

note to the final sound, the note is checked against the rest of the notes being played at that time 

to ensure that the sound produced will be correctly in accordance with music theory. After both 

the chords and the melody is established, the program then complies it into a single music file 

that can be exported as a midi file and opened as sheet music in other programs such as Sibelius. 

Conclusions 

The generated music gives indications to the contents of the original image. Images with 

lighter colors, such as yellow, light blue, light green and other colors associated with joyful 

images will produce cheerful music. Similarly, darker colored images will generate somber 

music. The same correlation exists for arousal values; images with great variances in arousal will 

create images that make drastic changes. Images with little variance will be represented with 

very similar notes that seem to roll throughout the entirety of the piece. While the correlation 

between the source and the resulting music is not strong enough to indicate all information 

present in the image, it gives insight into the main themes present in the art. 

Limitations and Assumptions 

For the prototype software to function properly, the assumptions that the valence of the 

emotions present in the image are based solely on the average color and that the arousal values 

are based solely on the magnitude of the variation in colors. The program will also only work 

for Americans because Thayer’s Emotional Plane the program uses is based on people living in 

the United States. Other planes are necessary for other locations around the world. In some 

images, the mood of the piece is not accurately portrayed by these characteristics. The program 

was designed to take any digital image as input and base the outputted music solely on that data. 

For this reason, the input is uncontrollable, and the only controllable aspect is the process the 

image undergoes. 

Applications and Future Experiments 

The work on image analysis is a breakthrough in inter-sensory interpretation, a technique 

that permits the perception of one sense to be perceived similarly through another sense. The 

program resulting from this project will provide the ability not only to visualize an image, but 

also to experience it aurally. The next step in the project is to add seventh chords, a style of 

chord currently unsupported by the program. This will allow the use of more complex chord 

progressions, which will eventually result in a more accurate aural image portrayal. To maximize 

the features in the program additional instruments and volume levels would have to be added. 

6


Literature Cited 

Hu, X. (2010). improving mood classification in music digital libraries by combining lyrics and 

audio. Association for Computing Machinery, Retrieved from 

http://xml.engineeringvillage2.org 

Jun, S. (2010). Music retrieval and recommendation scheme based on varying mood sequences. 

International Journal on Semantic Web and Information Systems, 6(2), Retrieved from 

http://find.galegroup.com 

Kuehni, R. G. (2005). Color an introduction to practice and principles (2 nd ed.). Hokoken, New 

Jersey: John Wiley & Sons, Inc. 

Laurier, C., Meyers, O., Serrá, J., Blech, M., Herrera, P., & Serra, X. (2009). Indexing music by 

mood: design and integration of an automatic content-based annotator. Multimedia Tools 

and Applications, 47(3), Retrieved from http://find.galegroup.com 

Loy, Gareth D. (2006). Musimathics: A Guided Tour of the Mathematics of Music. (Vol. 1). 

Cambridge, MA: The MIT Press. 

Musical Sound. (2010). In Encyclopædia Britannica. Retrieved October 20, 2010, from 

Encyclopædia Britannica Online: http://www.britannica. 

com/EBchecked/topic/399266/musical-sound/64497/Pitch-and-timbre?anchor=ref529625 

Neelamani, R. (2006). Jpeg compression history estimation for color images. IEEE Transactions 

on Image Processing, 15(6), Retrieved from http://ieeexplore.ieee.org 

Olson, H. F. (1967). Music, physics, and engineering (2 nd ed.). New York: Dover Publications 

Inc. 

Schaie, K. W. (1961). Scaling the association between colors and mood-tones. The American 

Journal of Psychology, 74, 226-273. 

Sharma, G. (ed.). (2002). Digital color imaging handbook. CRC Press. 

7

Generating Music Through Image Analysis - the Scientia Review

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?