If you try to make a very short sound in a reverberant room, like a single click or clap, you will hear the room respond with reflected copies of the original sound. It typically starts out with a few clearly separable reflections from the nearest walls, but as time passes numerous reflections upon reflections merge into a diffuse, late reverberation that finally dies out. This pattern of reflections characterizes the room and the listening position, and it is what we call the room response, or if the source sound is an infinitely short impulse, the impulse response.
The audible outcome of producing a sound in that given room will be a combination of time-shifted, weighted copies of the source sound following the pattern described by the impulse response. The process of combining time-shifted copies in that way is called convolution. The source sound is convolved with the impulse response of the room, and the result is a sound showing traits of both the source and the room. The convolution process works both ways (is commutative), which means that singing in a room can be described either as convolving the voice with the room or as convolving the room with the voice.
Reverberation is probably the best known example of convolution in audio processing,but convolution is actually a quite common mechanism. Running a sound through a simple FIR (Finite Impulse Response) filter is nothing but convolution. Convolution is also found in spatialization, in audio morphing, or in physical modelling where an excitation is convolved with a resonance.