0203. Introduction to Artificial Intelligence

Perception

 

1. Introduction

For a system, perception is usually defined as the process of acquiring, interpreting, selecting, and organizing sensory information.

Perception presumes sensation, where various types of sensors each only reacts to a certain type of simple signal. To put the sensations together and to make sense out of them is the job of the perception system.

Perception can be seen as a special type of categorization (or classification, pattern recognition) where raw material comes into the system as analogue signals, and the results are categorical judgments and conceptual relations.

Accurately speaking, we never "see things as they are", and perception process of an intelligent system is often (and should be) influenced by internal and external factors beside the signals themselves. Furthermore, perception is not pure passive process driven by the input.

In AI, the study on perception is mostly focused on the reproduction of human perception, especially on the perception of aural and visual signals. However, this is not necessarily the case since the perception mechanism of a computer system does not have to be identical to that of a human being.

 

2. Speech

Automatic Speech Recognition (ASR) is the front-end of a system that can perceive and understand spoken language, as shown in the speech-to-speech translator demostration.

There are several approaches toward automatic speech recognition:

The acoustic-phonetic approach postulates that there exist finite, distinctive phonetic units (phonemes) in spoken language and that these units are broadly characterized by a set of acoustic properties. See an example. Even though the acoustic properties of phonetic units are highly variable, both with speakers and with neighboring sounds, it is assumed in the acoustic-phonetic approach that the rules governing the variability are straightforward and can be readily learned.

The pattern-matching approach involves two essential steps, pattern training and pattern comparison. A speech-pattern representation can be in the form of a speech template or a statistical model. In the pattern-comparison stage of the approach, a direct comparison is made between the unknown speech (the speech to be recognized) with each possible pattern learned in the training stage in order to determine the identity of the unknown. The pattern-matching approach has become the predominant method of speech recognition in the last decade. Example: Training Neural Networks for Speech Recognition.

The artificial intelligence approach attempts to do speech recognition in knowledge-based systems according to the way a person applies intelligence to this type of task.

For the most part, machines have been successful in recognizing carefully articulated and read speech. Spontaneous human conversation has proven to be much more difficult a task.

A related topic is Speech Synthesis (though it is not input/perception, but output/action). This is the translation from text to speech. The text analysis capabilities of the system detect the ends of sentences, perform some rudimentary syntactic analysis, expand digit sequences into words, and disambiguate and expand abbreviations into normally spelled words which can then be analyzed by the dictionary-based pronunciation module. The pronunciation module provides pronunciations for most ordinary words, and morphological derivatives thereof, as well as proper names; default strategies exist for pronouncing words not recognized by the dictionary-based methods. Other components handle prosodic phrasing, word accentuation, sentence intonation, and the actual speech synthesis.

There is an on-line demo of the speech synthesis technology. The result sounds like this.

Remaining problems: naturalness, especially context and meaning related adjustments (emotion, stress, tone, ...).

Sample speech recognition/synthesis products:

Further reading on speech.

Music perception and composition are also studied in AI. Examples:

 

3. Vision

Vision begins with a large array of measurements of the light reflected from object surfaces onto the eye. Analysis then proceeds in multiple stages, with each producing increasingly more useful representations of information in the scene.

Computational studies suggest three primary representational stages.

  1. Early representations may capture information such as the location, contrast, and sharpness of significant intensity changes or edges in the image. Such changes correspond to physical features such as object boundaries, texture contours, and markings on object surfaces, shadow boundaries, and highlights. In the case of a dynamically changing scene, the early representations may also describe the direction and speed of movement of image intensity changes.
  2. Intermediate representations describe information about the three-dimensional (3-D) shape of object surfaces from the perspective of the viewer, such as the orientation of small surface regions or the distance to surface points from the eye. Such representations may also describe the motion of surface features in three dimensions.
  3. Higher-level representations of objects describe their 3-D shape, form, and orientation relative to a coordinate frame based on the objects or on a fixed location in the world. Tasks such as object recognition, object manipulation, and navigation may operate from the intermediate or higher-level representations of the 3-D layout of objects in the world.
On the other hand, a later stage also has influence on an earlier stage. The knowledge, intention, and context of the system also have effect on the final result of vision.

Sample image recognition and processing systems:

Further reading: Wikipedia, applications, links.

For relatively simple pattern recognition problems, neural network is often used to directly map input into output through a learning process.

Vision is not a pure input process. Eye movement has important impact on visual perception. An active vision system is one that is able to interact with its environment by altering its viewpoint rather than passively observing it, and by operating on sequences of images rather than on a single frame. Also, there is some study on using the eye-gaze of a computer user in the interface to aid the control of the application.

 

4. High-level perception

By "higher-level perception", we mean how the given input data is categorized. While in low-level perception, the processing is mostly "bottom-up", i.e., the output is more or less a function of the input, in higher-level perception there are many more factors involved.

"One of the most important properties of high-level perception is that it is extremely flexible. A given set of input data may be perceived in a number of different ways, depending on the context and the state of the perceiver. Due to this flexibility, it is a mistake to regard perception as a process that associates a fixed representation with a particular situation. Both contextual factors and top-down cognitive influences make the process far less rigid than this." [more on this topic].

Copycat is an AI system working on analogy problem. The Section 7.3.2 of the textbook contains a brief description of the system.