Generating Music Through Image Analysis - the Scientia Review
Generating Music Through Image Analysis - the Scientia Review
Generating Music Through Image Analysis - the Scientia Review
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Music</strong> Generation <strong>Through</strong> <strong>Image</strong>s<br />
<strong>Generating</strong> <strong>Music</strong> <strong>Through</strong> <strong>Image</strong> <strong>Analysis</strong><br />
Gregory Granito<br />
Massachusetts Academy of Math and Science<br />
Abstract<br />
In order to create a bridge between visual perception and aural perception, emotions were<br />
used as a link to convert a purely visual form of art into a purely audio form. The conversion is<br />
intended to leave <strong>the</strong> emotional essence of <strong>the</strong> work unsca<strong>the</strong>d. <strong>Through</strong>out <strong>the</strong> duration of <strong>the</strong><br />
project, Wolfram Ma<strong>the</strong>matica served as <strong>the</strong> development language and environment for <strong>the</strong><br />
creation of <strong>the</strong> program. The software developed requires <strong>the</strong> input of a digital representation of<br />
artwork, a digital image, and outputs audible music that is largely related to <strong>the</strong> original image.<br />
To perform such a conversion, <strong>the</strong> image is analyzed to isolate valence and arousal values. With<br />
<strong>the</strong> use of Thayer’s Emotional Plane, <strong>the</strong>se values can be mapped to feelings that all humans<br />
experience, <strong>the</strong>reby capturing <strong>the</strong> intrinsic emotions present in <strong>the</strong> image are recognized. Then<br />
<strong>the</strong> same feelings are generated in music based on <strong>the</strong> principles of music <strong>the</strong>ory.<br />
Introduction<br />
Sensory perception, for most, is a modal experience, which means that <strong>the</strong> responses of one<br />
sense to a stimulus do not directly affect <strong>the</strong> responses of ano<strong>the</strong>r sense. Light cannot be heard<br />
and sound cannot be seen. To be able to break this modality and perceive a stimulus in multiple<br />
ways could be a preface to new revolutionary technology to aid patients who have some type of<br />
sensory deficit. The focus of this project is to develop a program that will specifically break <strong>the</strong><br />
barrier between vision and hearing by transforming a digital image into correlating music.<br />
Emotion provides a basis for such a conversion because both images and music have underlying<br />
emotions present in <strong>the</strong>m. As such, this project attempts to use <strong>the</strong> emotions elicited by an image<br />
to create music that is linked to it.<br />
Literature <strong>Review</strong><br />
Emotional Response to Color Stimuli<br />
People react to colored stimuli differently depending on <strong>the</strong> hue that is shown to <strong>the</strong>m. Each<br />
pigment has a set of feelings that tend to emerge when that color is recognized by <strong>the</strong> brain.<br />
While for any emotion, it is possible for multiple colors to cause it to be felt. There are distinctly<br />
observable responses that occur most often for each color. Red invokes protective and defensive<br />
reactions, and orange is exciting. Yellow is also exciting, but it is also cheerful and jovial. Green<br />
is unusual in that it does not strongly correlate with any response. The colors blue and brown<br />
cause pleasant and secure feelings. Stately and dignified emotions arise when <strong>the</strong> stimuli is<br />
1
<strong>Music</strong> Generation <strong>Through</strong> <strong>Image</strong>s<br />
purple. White is calm and tender, but black is unhappy, disturbed, and danger. Grey causes<br />
emotions of boredom and melancholy. Each hue can stimulate <strong>the</strong> arousal of emotions, and<br />
nearly every emotion is associated with a color (Laurier et al., 2009)<br />
Figure 1. Emotion based on arousal and<br />
valence. The graph shows <strong>the</strong> emotion based<br />
on valence and arousal ratios (Laurier et al.,<br />
2009).<br />
Figure 1 shows that scientists can express numerous emotions with <strong>the</strong> use of solely 2 values,<br />
valence and arousal. Valence is whe<strong>the</strong>r <strong>the</strong> emotion is a good emotion, such as feeling happy, or<br />
a bad emotion, such as feeling sad. Arousal is how strong <strong>the</strong> emotions are felt and can relate to<br />
how much energy <strong>the</strong> subject has (Schaie, 1961). Scientists have been able to associate every<br />
emotion depicted in figure 1 to a corresponding valence and arousal (Laurier et al. 2009).<br />
<strong>Music</strong> Theory and <strong>Music</strong>al Notation<br />
<strong>Music</strong> is full of emotion, which is <strong>the</strong> reason music is appealing. Numerically identifying <strong>the</strong><br />
mood of a music piece has been a difficult science. Researchers hope to match sections of music<br />
up with <strong>the</strong> emotions associated with those patterns. By identifying common musical patterns<br />
that invoke emotion, investigators should be able to search any musical number for those<br />
patterns <strong>the</strong>reby identifying <strong>the</strong> mood of <strong>the</strong> piece. Once <strong>the</strong> mood of a piece can be identified, it<br />
can be matched to people who enjoy that type of music. People prefer certain types of music<br />
because <strong>the</strong>y cause certain emotions. By isolating <strong>the</strong>se emotions, scientists can match a person<br />
up with <strong>the</strong> music <strong>the</strong>y favor (Jun, 2010).<br />
Studies have based mood classifications on audio files alone. However, lyrics also play a<br />
large role in <strong>the</strong> emotion of a piece. Certain words inspire emotions. The effect of lyrics on <strong>the</strong><br />
mood of a musical number had been ignored but recently <strong>the</strong>re have been studies that have<br />
looked into <strong>the</strong> feelings that lyrics inspire, but to accurately classify <strong>the</strong> emotions in a musical<br />
2
<strong>Music</strong> Generation <strong>Through</strong> <strong>Image</strong>s<br />
selection, both lyrics and musical sounds need to be taken into consideration. Using both aspects<br />
of music in <strong>the</strong> analysis led to generally more accurate results (Hu, 2010).<br />
Western common music notation system, or CMN, is one of <strong>the</strong> most common musical<br />
notations. CMN uses <strong>the</strong> location of symbols on a 5 line staff to determine pitch. In this system<br />
<strong>the</strong>re are 12 notes: C, C♯ , D, D♯ , E, F, F♯ , G, G♯ , A, A♯ , and B. Each note is higher than <strong>the</strong><br />
previous. Sharps, notes with a #, are halfway between <strong>the</strong> unsharpened note and <strong>the</strong> note that is<br />
one higher that it. There are no notes between <strong>the</strong> notes E and <strong>the</strong> notes F and B and C. Each of<br />
<strong>the</strong>se notes can also be raised or lowered to a different octave (“<strong>Music</strong>al Sound”, 2010). A note<br />
in an octave is has a frequency twice as high as <strong>the</strong> same note in <strong>the</strong> octave directly beneath it.<br />
A4 (<strong>the</strong> standard for musical notes from which all o<strong>the</strong>rs are derived) has a frequency of 440 Hz.<br />
The octave above that must have a frequency of 880Hz and <strong>the</strong> octave below it must have a<br />
frequency of 220 Hz (Olson, 1967). Generally, any octave of a note can serve <strong>the</strong> same purpose<br />
in a piece of music. The frequencies of all of <strong>the</strong> o<strong>the</strong>r notes besides A4 in CMN are based on <strong>the</strong><br />
equal-temperance scale. In this scale, an octave is divided into 12 intervals. The ratio between<br />
each consecutive note is always equal, hence <strong>the</strong> name equal-temperance (Loy, 2006).<br />
Different physical construction causes different musical instruments to be better suited to<br />
specific ranges of pitch. While <strong>the</strong>y are not always limited to this range, <strong>the</strong> following notes are<br />
<strong>the</strong> commonly used notes for some common instruments:<br />
Percussion: Piano A0-C8, Organ C1-B7, Bells F5-C8, Chimes C5-F6, Xylophone C4-<br />
C8, Vibraphone F3-C7, Marimba A2-C7, Timpani F2-F3<br />
Woodwind: Piccolo C5-A♯ 7, Flute C4-C7, Soprano Saxophone F3-D♯ 6, Alto<br />
Saxophone C♯ 3-G♯ 5, Tenor Saxophone G♯ 2-D♯ 5, Baritone Saxophone C♯ 2-D♯ 5, Bass<br />
Saxophone G♯ 1-D♯ 4, Soprano Clarinet D3-C♯ 6, Alto Clarinet G2-G♯ 5, Bass Clarinet D2-<br />
D♯ 5, Oboe A♯ 3-F6, English Horn F3-F5, Bassoon A♯ 1-D♯ 5<br />
Brass and Strings: Cornet/Trumpet F3-A♯ 5, French Horn B1-F5, Trombone/Euphonium<br />
E2-A♯ 4, Bass Tuba E1-A♯ 3, Guitar E2-F5, Harp C1-G7, Violin A♯ 3-C7, Viola C3-C6, Cello<br />
C2-E5, Bass E1-A♯ 3<br />
Different notes are classified into different voices. <strong>Music</strong> is usually composed of multiple<br />
parts, each with a different voice which covers a different section of <strong>the</strong> available range of sound.<br />
Soprano is <strong>the</strong> highest in average pitch, followed by alto, tenor, baritone, and bass respectively.<br />
These different instruments typically fall in <strong>the</strong> following ranges: Soprano C4-C6, Alto G3-G5,<br />
Tenor D3-A♯ 4, Baritone A2-G4, and Bass E2-D♯ 4<br />
(Pierce, 1992).<br />
<strong>Image</strong> Processing and Digital <strong>Image</strong> Data<br />
Scientists have had difficulty defining color; however, investigators define it as an attribute of<br />
optical perception. Researchers have tried to come up with a better explanation but describing<br />
<strong>the</strong> term has proved very difficult and <strong>the</strong> results have often been unsatisfying. Because color is a<br />
property of <strong>the</strong> visual experience, it can inspire emotions (Sharma, 2006).<br />
Modern images comprise numerous square pixels. In a computer all images must contain pixels,<br />
from stored images in memory to images captured by cameras to images displayed by <strong>the</strong><br />
monitor. A pixel can be any color from <strong>the</strong> list of 16,777,216 available colors, which computers<br />
identify by a 6-digit hexadecimal code. The pixels act like pieces of a mosaic; <strong>the</strong>re are<br />
numerous of <strong>the</strong>m in different colors and <strong>the</strong> becomes apparent when <strong>the</strong> image as a whole is<br />
examined. Computers can represent any image by using enough pixels (Kuehni, 2005).<br />
3
<strong>Music</strong> Generation <strong>Through</strong> <strong>Image</strong>s<br />
Digital images can be stored in many different formats. The simplest form is a bitmap. This<br />
simply contains <strong>the</strong> data for <strong>the</strong> color of each individual pixel and is organized in a way that<br />
indicates <strong>the</strong> location of <strong>the</strong> pixel. The major problem with this method is that it requires <strong>the</strong><br />
storage and handling of vast quantities of data. This problem causes computers to have difficulty<br />
moving, working with and saving <strong>the</strong> images. To circumvent this problem, computer scientists<br />
developed compressed formats for image storage. The development of new formats allowed<br />
large images to be stored in a much smaller amount of data. The most prevalent method of image<br />
compression is <strong>the</strong> JPEG format. Compression relies on grouping areas of similar pixel color and<br />
storing rules <strong>the</strong> computer can follow to reproduce a nearly identical image. The largest problem<br />
with this method is it that it is only able to store an image with 95% accuracy. In addition, <strong>the</strong><br />
quality can be reduced to fur<strong>the</strong>r compress <strong>the</strong> image. By compressing images, quality can be<br />
significantly reduced (Neelamani, 2006).<br />
<strong>Image</strong> processing is a technique used by computer scientists to analyze images through <strong>the</strong> use of<br />
computers. This can be both difficult and processor intensive. Different means can be used to<br />
search for patterns in images. The computer can compare average color, it can look for large<br />
changes in color, it can break colors into components and analyze each component differently,<br />
and it can analyze images in any o<strong>the</strong>r way it is programmed. <strong>Image</strong> processing can be difficult<br />
to do, but if done correctly, it can be an extremely powerful tool (Sharma, 2010).<br />
<strong>Image</strong> Processing in Ma<strong>the</strong>matica<br />
Wolfram Research has created Ma<strong>the</strong>matica, a software tool that helps ma<strong>the</strong>maticians do<br />
complex ma<strong>the</strong>matics. In addition to performing math, it has been adapted so that it offers a<br />
programming language similar to several o<strong>the</strong>r programming options. Ma<strong>the</strong>matica offers unique<br />
ways to handle and analyze images. For instance, it is very easy to import images; <strong>the</strong>y can<br />
simply be dragged into <strong>the</strong> notebook file. Once <strong>the</strong>y are in <strong>the</strong> document, <strong>the</strong>re are numerous<br />
functions available, ranging from partitioning an image to grouping images by color schemes.<br />
<strong>Image</strong> partition, a function Wolfram Research has provided, takes an image and a partition size.<br />
Ma<strong>the</strong>matica <strong>the</strong>n returns a list of lists of smaller images. These smaller images are squares with<br />
a side length equal to <strong>the</strong> partition size specified. This can be very useful, allowing <strong>the</strong> image to<br />
be broken down into usable sections, helping simplify <strong>the</strong> image analysis process. This function<br />
along with o<strong>the</strong>r functions that help with color analysis allows a program to process images with<br />
relative ease.<br />
<strong>Music</strong> and Sound<br />
According to Sarah Rutkiewicz, an expert in music <strong>the</strong>ory and musical education, music<br />
is organized according to pitch, rhythm, or harmony. This organization leads to an artistic<br />
element present in <strong>the</strong> arrangement that <strong>the</strong> performer can reproduce. However, noise is random,<br />
meaningless, accidental, and lacks <strong>the</strong> essential artistic element. While not all noise is music,<br />
music can be considered a more specific type of noise (personal communication)<br />
4
<strong>Music</strong> Generation <strong>Through</strong> <strong>Image</strong>s<br />
Research Plan<br />
A. Researchable Question or engineering problem being addressed<br />
The goal of this project is to design a program that uses an image as input to generate music that<br />
is emotionally linked to <strong>the</strong> image.<br />
B. Hypo<strong>the</strong>sis/ Goals:<br />
The goal of <strong>the</strong> research and development is to code a program that uses an image as input to<br />
create music that is emotional linked to <strong>the</strong> image.<br />
C. Description in detail of methods or procedures:<br />
Ma<strong>the</strong>matica will be used to analyze inputted images and create music. The methods that will be<br />
used include finding <strong>the</strong> size, ratio of length to width, and average color value of <strong>the</strong> image;<br />
searching for areas of a large concentration of a color and searching for dramatic changes in hue;<br />
and looking for patterns regarding <strong>the</strong> number of times a certain color appears. Each one of <strong>the</strong>se<br />
properties of <strong>the</strong> image will correspond to a property of music, such as <strong>the</strong> key signature, time<br />
signature, tempo, instrumentation. The image will be partitioned into smaller sections that will<br />
correspond to measures of <strong>the</strong> music. The overall color value of <strong>the</strong>se smaller images will be<br />
used to determine <strong>the</strong> chord of <strong>the</strong> measure. Finally, <strong>the</strong> smaller sections will be broken again in<br />
smaller sections that will represent one thirty-second note. The brightness of <strong>the</strong>se portions will<br />
be used to determine whe<strong>the</strong>r <strong>the</strong>re is sound or silence during that time that <strong>the</strong> thirty-second<br />
note represents.<br />
Methodology<br />
Ma<strong>the</strong>matica 7.0 1.0 for Students from Wolfram Research was used to write a program<br />
that analyzes a digital image and generates audible music. An image was copied to <strong>the</strong> clipboard<br />
and pasted into <strong>the</strong> Ma<strong>the</strong>matica file next to <strong>the</strong> variable called image, and <strong>the</strong> program was<br />
subsequently run. After <strong>the</strong> music was generated, <strong>the</strong> command Export["SoundFile.mid”,<br />
Sound[data, time]] was run in Ma<strong>the</strong>matica, which created a MIDI file on <strong>the</strong> C drive of <strong>the</strong><br />
computer. This file was opened in Sibelius 6.1.0, which displayed <strong>the</strong> music as sheet music.<br />
Sibelius was used to print <strong>the</strong> sheet music.<br />
Results and Discussion<br />
The application resulting from this investigation begins with a path or a URL to an<br />
image. After importing <strong>the</strong> image, <strong>the</strong> overall valence and arousal of <strong>the</strong> graphic is determined<br />
using <strong>the</strong> average color and <strong>the</strong> average change in color from one pixel to <strong>the</strong> next. Using <strong>the</strong>se<br />
values to determine <strong>the</strong> average emotion present in <strong>the</strong> piece <strong>the</strong> program selects an appropriate<br />
key that correlates <strong>the</strong> best with that emotion by calculating <strong>the</strong> closest predefined point on<br />
Thayer’s Emotional plane. The emotion that lies <strong>the</strong> closes to <strong>the</strong> point representing <strong>the</strong> image<br />
that is being analyzed is used.<br />
After discerning <strong>the</strong> key, chord progressions are formed in a similar manner. The image<br />
is subdivided into numerous square partitions. Each partition is analyzed using <strong>the</strong> same process.<br />
The resulting vector on <strong>the</strong> Thayer Emotional Plane and <strong>the</strong> previously played chord are used to<br />
determine which chord is next in <strong>the</strong> progression; however, <strong>the</strong> program always ensures that <strong>the</strong><br />
chord matches appropriately with <strong>the</strong> surrounding music by comparing it to <strong>the</strong> previous chord.<br />
5
<strong>Music</strong> Generation <strong>Through</strong> <strong>Image</strong>s<br />
This process is repeated for every section of <strong>the</strong> image until a lengthy progression of chords has<br />
been constructed.<br />
Upon <strong>the</strong> completion of <strong>the</strong> chords, <strong>the</strong> sub-sections are reanalyzed in order to form a<br />
melody. This process looks at <strong>the</strong> valence and arousal values of even smaller subsections of<br />
pixels. This allows multiple melodic notes to be played with each chord. Before committing each<br />
note to <strong>the</strong> final sound, <strong>the</strong> note is checked against <strong>the</strong> rest of <strong>the</strong> notes being played at that time<br />
to ensure that <strong>the</strong> sound produced will be correctly in accordance with music <strong>the</strong>ory. After both<br />
<strong>the</strong> chords and <strong>the</strong> melody is established, <strong>the</strong> program <strong>the</strong>n complies it into a single music file<br />
that can be exported as a midi file and opened as sheet music in o<strong>the</strong>r programs such as Sibelius.<br />
Conclusions<br />
The generated music gives indications to <strong>the</strong> contents of <strong>the</strong> original image. <strong>Image</strong>s with<br />
lighter colors, such as yellow, light blue, light green and o<strong>the</strong>r colors associated with joyful<br />
images will produce cheerful music. Similarly, darker colored images will generate somber<br />
music. The same correlation exists for arousal values; images with great variances in arousal will<br />
create images that make drastic changes. <strong>Image</strong>s with little variance will be represented with<br />
very similar notes that seem to roll throughout <strong>the</strong> entirety of <strong>the</strong> piece. While <strong>the</strong> correlation<br />
between <strong>the</strong> source and <strong>the</strong> resulting music is not strong enough to indicate all information<br />
present in <strong>the</strong> image, it gives insight into <strong>the</strong> main <strong>the</strong>mes present in <strong>the</strong> art.<br />
Limitations and Assumptions<br />
For <strong>the</strong> prototype software to function properly, <strong>the</strong> assumptions that <strong>the</strong> valence of <strong>the</strong><br />
emotions present in <strong>the</strong> image are based solely on <strong>the</strong> average color and that <strong>the</strong> arousal values<br />
are based solely on <strong>the</strong> magnitude of <strong>the</strong> variation in colors. The program will also only work<br />
for Americans because Thayer’s Emotional Plane <strong>the</strong> program uses is based on people living in<br />
<strong>the</strong> United States. O<strong>the</strong>r planes are necessary for o<strong>the</strong>r locations around <strong>the</strong> world. In some<br />
images, <strong>the</strong> mood of <strong>the</strong> piece is not accurately portrayed by <strong>the</strong>se characteristics. The program<br />
was designed to take any digital image as input and base <strong>the</strong> outputted music solely on that data.<br />
For this reason, <strong>the</strong> input is uncontrollable, and <strong>the</strong> only controllable aspect is <strong>the</strong> process <strong>the</strong><br />
image undergoes.<br />
Applications and Future Experiments<br />
The work on image analysis is a breakthrough in inter-sensory interpretation, a technique<br />
that permits <strong>the</strong> perception of one sense to be perceived similarly through ano<strong>the</strong>r sense. The<br />
program resulting from this project will provide <strong>the</strong> ability not only to visualize an image, but<br />
also to experience it aurally. The next step in <strong>the</strong> project is to add seventh chords, a style of<br />
chord currently unsupported by <strong>the</strong> program. This will allow <strong>the</strong> use of more complex chord<br />
progressions, which will eventually result in a more accurate aural image portrayal. To maximize<br />
<strong>the</strong> features in <strong>the</strong> program additional instruments and volume levels would have to be added.<br />
6
<strong>Music</strong> Generation <strong>Through</strong> <strong>Image</strong>s<br />
Literature Cited<br />
Hu, X. (2010). improving mood classification in music digital libraries by combining lyrics and<br />
audio. Association for Computing Machinery, Retrieved from<br />
http://xml.engineeringvillage2.org<br />
Jun, S. (2010). <strong>Music</strong> retrieval and recommendation scheme based on varying mood sequences.<br />
International Journal on Semantic Web and Information Systems, 6(2), Retrieved from<br />
http://find.galegroup.com<br />
Kuehni, R. G. (2005). Color an introduction to practice and principles (2 nd ed.). Hokoken, New<br />
Jersey: John Wiley & Sons, Inc.<br />
Laurier, C., Meyers, O., Serrá, J., Blech, M., Herrera, P., & Serra, X. (2009). Indexing music by<br />
mood: design and integration of an automatic content-based annotator. Multimedia Tools<br />
and Applications, 47(3), Retrieved from http://find.galegroup.com<br />
Loy, Gareth D. (2006). Musimathics: A Guided Tour of <strong>the</strong> Ma<strong>the</strong>matics of <strong>Music</strong>. (Vol. 1).<br />
Cambridge, MA: The MIT Press.<br />
<strong>Music</strong>al Sound. (2010). In Encyclopædia Britannica. Retrieved October 20, 2010, from<br />
Encyclopædia Britannica Online: http://www.britannica.<br />
com/EBchecked/topic/399266/musical-sound/64497/Pitch-and-timbre?anchor=ref529625<br />
Neelamani, R. (2006). Jpeg compression history estimation for color images. IEEE Transactions<br />
on <strong>Image</strong> Processing, 15(6), Retrieved from http://ieeexplore.ieee.org<br />
Olson, H. F. (1967). <strong>Music</strong>, physics, and engineering (2 nd ed.). New York: Dover Publications<br />
Inc.<br />
Schaie, K. W. (1961). Scaling <strong>the</strong> association between colors and mood-tones. The American<br />
Journal of Psychology, 74, 226-273.<br />
Sharma, G. (ed.). (2002). Digital color imaging handbook. CRC Press.<br />
7