This portfolio site is outdated! Please go to my new site at

put a happy tree there

gesture and voice-based painting game · spring 2017

done for

6.835 Intelligent Multimodal User Interfaces

using tools such as

JavaScript · HTML · CSS · Leap.js · Leap SDK · Leap Motion sensor · HTML5 Speech API

Put a Happy Little Tree There is a project that aims to explore new, magical ways of interaction in creating digital art. Inspired by Bob Ross's show, The Joy of Painting, this project allows users to create a landscape painting, with trees, mountains, and clouds, by simply speaking and pointing at their computer screen - no knowledge of painting or drawing required!

Check out the demo above to see my final version of the project. The app (as of now) is web-based and requires a microphone and a Leap Motion controller to interpret speech and hand gestures.


In Spring 2017, I took a class called "Intelligent Multimodal User Interfaces". The class gave us free reign on final projects, as long as it fell int othe category of being a "multimodal app": basically, some type of digital media that included unconventional inputs, like gesture, voice, stylus drawing, etc. I decided I wanted to make an art-related one, using the inputs of gesture and voice cotrol.

Specifically, my project aims to explore how to make creating art (which often can be daunting) more enjoyable, asccesible and magical for a broader range of skills and abilities. Bob Ross' show was a large inspiration for the project, both visually in the sense that it scoped the app down to a landscape painting creator, but also thematically in the way that, like Bob and his soothing voice, the project aims to break down the painting process, step by step, in an enjoyable way.

My system also alludes to "Put-That-There", an HCI project created at MIT around 1980, which synthesizes voice and gesture control within a graphics interface, allowing a user to create and manipulate basic colored shapes. My project aims to leverage the same mechanisms of "Put That There" into a more expressive, artistic application.

system design

basic functionality

"Put a Happy Little Tree There" lets users create a landscape composition, element by element, using the tools of a Leap Motion sensor and their browser's microphone. For example, if you pointed at a location in the canvas and said "paint a tree there", the system would respond accordingly. The application currently allows 3 types of elements to be painted: mountains, trees, and clouds.

To get users acquainted with the mechanism of using gesture and speech to create elements in this manner, the system starts off with a tutorial mode that instructs the user how to paint a mountain on the canvas.

Elements such as clouds and trees also scale depending on the location of where they are placed on the canvas, in order to create the illusion of depth in the world of the painting.

multimodal color palette

A palette of 9 different colors for users allows some more creative flexibility for the user. The color of the brush can be changed gesturally, by hovering over the palette, or verbally, but dictating the color name.

manipulating elements

If the user makes a mistake or changes their mind about something they've painted, they can also delete and move items that they've already put on the canvas.


It was critical to make a "Help and Hints" screen accessible to summarize the mechanisms of the app explained in the tutorial and other tools not yet explained.

extra lil features

Since a large facet of what makes the app unique is how it tries to make creative tools more magical, my project includes some small playful design choices to add to the magic. When a user paints an element, sound effects of paint brushes play in the background. When a user paints a mountain, the system plays a clip of Bob Ross saying "happy little mountain". Also, to reference Bob Ross' famous quote "There are no mistakes, just happy accidents", the system paints a randomly placed cloud or tree on the canvas when a user says "mistake", along with playing an audio clip of that quote.


architecture overview

The system uses a Leap Motion Controller to detect the location of where a user points to on a screen, and the laptop microphone in order to take listen for verbal commands. User inputs for gesture and speech are interpreted with help from the JavaScript version of the Leap Motion SDK, and the HTML5 Web speech API, and then used to determine if the user meant a specific art command. The results of this command are reflected in a web-based visualization.

hand position

The Leap Motion SDK is really good at giving specific data about the positions and orientations of each finger in the hand - in my case, I check if the Leap identifies pointable object, which represents a finger pointing at the screen. The position data of this pointing finger helps my system know where the user intends to place an element on the canvas.

speech recognition

The HTML5 Web speech API uses techniques in order to convert audio signals into the most probably words that were said. My system checks whenever key words, such as "mountain", "tree", "cloud", are recognized, and then follows through with the visualization accordinately.

Since the API is often very innaccurate (i.e., the word "there" would often be recognized by "bear"), I tried to give strong leeway to the keywords that my system would recognize, including into the keyword collection words that sound similar to important words like "mountain", "tree", and "cloud".

combining modalities

My model combines speech output and the ongoing hand position dataset to track the time and location of when and where users say certain commands.

The basic model for combining the speech and gesture input relies on a collecting a certain number of most recent pointer finger positions and their corresponding timestamps, as returned by the Hand object using the Leap SDK. On the speech recognition side, the system collections a list of words in the current phrase, and calculates their corresponding time of utterance.

In the case of someone pointing and saying "paint a tree there" on the canvas, the system notices that the keyword "there" is said, determines the timestamp associated with "there", and searches for the finger position with the closest timestamp. "There" and "here" are special words that trigger the system to note the cursor at the time of dictation, and places elements at that location. "This" selects elements to be moved, and color names such as "red" and "blue" trigger the brush color to change.

output and visualization

I created assets for mountain, tree, and cloud either by photoshopping them out of Bob Ross paintings or painting them myself in Photoshop. They exist as HTML DOM elements within a containing block in the HTML. The addition of new elements adds new nodes to the HTML DOM, and they are positioned depending on the user's current cursor location at the time of the command. Colors change based off of modifying the elements CSS.

user testing

To see how the app would fare in the real world (a.k.a. with users that are not experts in the system like myself), I conducted user tests on a prototype on 5 people, aiming to see how natural the process of adding and manipulating elements on the painting was, how learnable the system was, and how enjoyable it was.

Overall, I received the impression that users found the application enjoyable, fun, and pleasing to use. While there were clear issues in usability, the user tests gave me key insights on what kinds of commands and interactions felt intuitive to users.

Some highlights include:

  • Users thought in general that my painting application was enjoyable!
  • My observations also told me that users struggled to move elements from one position to another - this wasn't so surprising because my method for corresponding the intended location of an object to an accurate time at which the command was dictated has a significant error margin. I think more work needed to be targeted at this in the future.
  • However, my user tests also indicated that giving users multiple ways of doing things (i.e. changing colors by dictation versus brush) was helpful, shown by how users were inclined to do whatever they felt was most natural.

(read more about user testing in my paper!!)


next steps

While my system is functional and runnable, there is a lot of work I'd like to do in before making it live and sharing it with an audience. Many people don't have a Leap laying around, so I would want to look into other things that can approximate hand motion, such as perhaps using a JavaScript library for hand recognition based off of webcam feed, or even just creating a mouse for mouse control.

I'd also like to work on making more varied and polished graphics in order to help make it feel like users are truly creating a beautiful end product with the application.

Finally, in terms of performance, I think there's a lot that can be done on improving the model of speech and gesture such that it recognizes commands more accurately. Perhaps this can be done by smoothing of hand position data, and performing user tests that are more focused on testing the accuracy of the system.


One big thing that I realized through this project is that not only is designing for unconventional modalities difficult, but instructing people to learn the unconventional modalities is a difficult problem in and of itself. It takes a lot of careful thought on what words, illustrations, and affordances to use in order for people to more comfortably and naturally learn, and often these design decisions can be very subtle.

Another lesson that I've learned is that magical, playful touches can go a long way in terms of approaching user experience from an enjoyability aspect. While it definitely goes hand in hand with usability and system performance, the ability to create a system that is memorable and emotionally engaging, and the decisions that help contribute to a system being those things, is an area I'd like to pursue more deeply in the future.

related work