Abstract. Since the mid-20th century EVP (Electronic Voice Phenomena) has been the focus of countless debates. Among them is the interpretation of what is truly being spoken within each recording. Since most EVP recordings are low in quality, phonetic analysis is often difficult and therefore most researchers rely on their hearing and audible interpretations to determine the words contained in each file. The purpose of this experiment is to establish the accuracy percentage at which human hearing can identify spoken words in random statements contained in low quality recordings. To perform this experiment we have created twenty simulated EVP recordings, each with similar background noise and vocal styles (normal speech, whispers, mumbles etc.) as those found in purported anomalous recordings. The recordings were created in various environments by three N.E.C.A.P.S. staff members (C. Wong, B. Hantzis, M. Baker) and presented within in two separate online surveys, displaying each recording independently. The volunteers then listened to the recordings and reported the words (if any) they felt were contained in each file. The results (123 for survey 1A and 108 for survey 1B) were downloaded and analysed for grading accuracy and to establish perception patterns. Our findings have shown that none of the volunteers scored above 80% accuracy for survey 1A and 50% for survey 1B. The average accuracy percentage for survey 1A was 49% and survey 1B was 28%. The results of this experiment indicate that human perception is not an accurate methodology for determining non contextual spoken words contained in an EVP recording. Inaccurate interpretations appear to be due to various neurological and psychological obstacles such as various biases, anticipation and pareidolia. These obstacles greatly affect the comprehension and or objectivity of the listener’s perspective.

Keywords: EVP, Perceptions, Communications, Anomalous Phenomena

Biographical Note(s):
Michael Baker has been conducting research into fortean claims for over 10 years. He has created numerous devices and research methods for this purpose. His professional expertise is primarily in bio-medical electronics engineering, complex data analysis and software development. He is a graduate of WTI – Boston 1996


There is an ongoing debate within the paranormal research world surrounding the efficacy of a researcher’s ability to comprehend the words spoken in alleged EVP recordings. The effects of subliminal influences such as pareidolia, apophenia and confirmation biases make objective discernment of linguistic identification difficult. Very often a personal perspective in conjunction with the aforementioned psychological obstructions tend to foreshadow the logical analysis and comprehension of audible research results.

Under these conditions objective research without the aid statistical or non-biased methods may not be possible as many researchers have proclaimed difficulty decoupling logical observational process from the apparent evidence presented by their own faculties. Often the solution to this mystification process is to remove personal observation and opinion from the research methodology and leave much of the results to statistical data gathered under strict controls. It should be noted that the research contained in this paper does not serve to confirm nor deny the concept of existential beings, nor does it serve to establish an explanation for the existence of EV Phenomena, instead this paper is interested in examining the accuracy level of human perception as it pertains to understanding alleged anomalous linguistic communication contained within EVP recordings.


Since naturally occurring EVP recordings are alleged to be from currently unknown sources, the identification of their content also remains enigmatic and subject to conjecture. Therefore in order to establish a reliable control we found it necessary to create simulated recordings with similar background noise and speech patterns as those found in natural EVP. Since the recordings are of our own design the identity of their content would not be in question, thus allowing us to accurately test the efficacy of human auditory perception.

We created a total of twenty simulated EVP recordings with varying background noise levels and complexity. The background noise ranged from (-111.9dB to -84.2dB) and contained common elements such as silence, wind, running water and movement of objects within the recording environment. The speech levels varied from whispers to average volume (-25.7dB) and the speed of each word or statement varied from slightly rushed to mildly lethargic. The purpose for these variances was to help us identify which conditions were optimal for achieving the greatest accuracy percentage. The words spoken in each recording were chosen by three N.E.C.A.P.S. staff members responsible for recording the samples. The content, while not specifically dictated, varies from common statements and words to unusual phrases and brand names. Several files include gibberish or non-speech. The purpose of the word variance was to understand the possible presence of a pareidolic or bias-like effect. If the volunteers listening to the files claimed to identify words in the “non-speech” statements we could possibly attribute those responses to some variance of mental influence (pareidolia) or perhaps even some level of anticipation or confirmation/situational bias.

The twenty files were divided up in to two surveys containing ten questions each and presented to volunteers via the internet from January 1, 2014 through December 31, 2014. Each audio file was presented individually in random order and volunteers were instructed to listen to each recording and report back what they have heard or, in some cases, didn’t hear. If they deemed the file as “unidentifiable”, volunteers were instructed to enter the phrase “I don’t know.” or the word “Nothing”. All other responses were to be written verbatim without punctuation in an open text box. Multiple choice answers were not used. Misspelled words and erroneous capitalizations did not affect the final tally of the surveys and each response was double checked visually to ensure the statistical results of each survey were accurate.


Survey 1A
1 It’s Raining
2 It’s Hot In Here
3 Dog
4 Gibberish
5 Gibberish
6 Gibberish
7 Hi
8 Can’t Breathe
9 Lobster
10 Tostitos

Survey 1B
1 String Theory
2 Kill Zak Bagans
3 Lens Cleaner
4 Cartwheel
5 Stand Off
6 Get in my Belly
7 Fraud
8 You want fries with that?
9 Soft Kitty Warm Kitty
10 Spiderman


The results show the accuracy rates for 123 respondents produced by Survey 1A and 108 respondents for Survey 1B.


The statistical results of our surveys appear to follow a pattern. In Survey 1A, the most accurate, non gibberish responses were comprised of common statements or words and conversely the least accurate responses appear to consist of more obscure or unknown words and phrases. This may suggest that the listener tends to have greater ease interpreting phrases or statements of which they have had more personal experience or exposure. To investigate this hypothesis we conducted a subsequent set of three surveys (consisting of 245 participants in total) that were presented independently on 6 various social media websites, public forums and blogs. Volunteers were asked to arrange the words and phrases used in Survey 1A and Survey 1B independently as well as both Surveys 1A & 1B collectively into an order of what they considered to be the most common words and phrases to least common. The surveys were conducted for several days and the results are as follows (Fig 5, Fig 6 & Fig 7):

As seen in the table above (Fig.5), words and phrases such as “Hi”, “Dog” and “It’s Raining” were considered to be the most common words and phrases presented in survey 1A while “Lobster” and “Tostitos” were consistently on the bottom of the list as least common. In Survey 1B (Fig. 6), “Stand Off”, “Spiderman” and “Cartwheel” topped the list as most common while words and phrases such as “Lens Cleaner”, “Get in my Belly” and “Kill Zak Bagans” were considered least common. When viewing the combined lists from both Surveys 1A & 1B (Fig 7) “Hi”, “Dog” and “It’s Raining” remained the most common. “Stand Off”, “Cartwheel” and “Spiderman” were presented as less common and “Lobster”, “Tostitos”, “Kill Zak Bagans”, “Lens Cleaner”, and “Get in my Belly” remained at or near the bottom of the list as least common.

The results of the secondary surveys appear to correlate with the results of audio surveys 1A & 1B. This proposed correlation is supported by the statistical positions or ranks of each word in each survey. For example, “It’s Raining”, “Dog” and “Hi were not only the easiest to identify in survey 1A, but they are also considered three of the most common words in daily communication according to the results of the secondary surveys. This data suggests that those words and phrases that were more easily identifiable in our experiment, were also the words and phrases considered to be the most common in daily communication, thus lending support to the hypothetical inference that the interpretation of words contained in non-contextual, recorded, audio-only utterances are subject to some level of bias, anticipation or default perception.

The process of speech perception is greatly aided by several linguistic elements. Among them is morphology, which is the identification, analysis, and description of the structure of a given language’s morphemes and other linguistic units, such as root words, affixes, parts of speech, intonations and stresses, or implied context. Equally influential is syntax (sentence structure) and semantics which is the relation between signifiers, like words, phrases, signs, and symbols, and what they stand for (semiotics). There is a distinct possibility that listeners do not have the ability to recognize phonemes prior to recognizing higher units such as words. After obtaining fundamental information about phonemic structure, listeners can typically compensate for inaudible or noise-masked phonemes using their knowledge of the spoken language.

Additional research has shown that naturally spoken words when presented in a sentence or phrase were more accurately identified as compared to the same words presented in isolation. Garnes and Bond (1976) have demonstrated that listeners typically have a tendency to judge ambiguous words according to the meaning of the whole sentence or phrase [9] [10]. This is known as the phonemic restoration effect [8] and could help to understand why many researchers incorrectly identify EVP in their native languages instead of languages native to the subject area history.

Although the primary language used within surveys 1A & 1B is English, it is our hypothesis that variations in language would present a significant obstacle in speech recognition, making audible identification of foreign languages contained in EVP recordings exceedingly difficult. Languages differ in their phonemic inventories. If two foreign-language sounds are assimilated to a single mother-tongue category the difference between them will be very difficult to discern. A classic example of this situation is the observation that Japanese learners of English will have problems with identifying or distinguishing English liquid consonants (/l/ and /r/) [12].

The process of perceiving speech begins at the level of the sound signal and the process of audition. After processing the initial auditory signal, speech sounds are further processed to extract acoustic cues and phonetic information. This speech information can then be used for higher-level language processes, such as word recognition [11]. One acoustic aspect of the speech signal may cue different linguistically relevant dimensions. For example, the duration of a vowel in English can indicate whether or not the vowel is stressed, or whether it is in a syllable closed by a voiced or a voiceless consonant, and in some cases (like American English /ɛ/ and /æ/) it can distinguish the identity of vowels.[7] Some experts even argue that duration can help in distinguishing of what is traditionally called short and long vowels in English.[6] Interestingly, two of the most accurately identified words (“Hi” and “Dog”) also contain the longest vowel duration. This combined with the common nature of the words may have added to the ease of their identification.
Additionally, one linguistic unit can be queued by several acoustic properties. Alvin Liberman (1957) showed that the onset formant transitions of /d/ differ depending on the following vowel but they are all interpreted as the phoneme /d/ by listeners [2]. This lack of onset formant discernment may contribute to higher level misinterpretation of words particularly in uncommon statements where vowel segments are occluded by noise or otherwise unidentifiable. This effect may possibly be observed in several results contained in both surveys 1A and 1B particularly in the general pattern of incorrect responses. While the responses were incorrect, they did share the same starting consonant and, in several cases, subsequent vowels that fall within the same or neighboring formant frequency range.
Acoustic cues such as voice onset time (VOT) help differentiate between separate phonetic categories. VOT is a primary cue signaling the difference between voiced and voiceless oral occlusives (known as plosives – [t] [d] [k] [ɡ] [p] [b] [ʔ]). Other cues differentiate sounds that are produced at different places or manners of articulation. The speech system must also combine these cues to determine the category of a specific speech sound. This is often thought of in terms of abstract representations of phonemes. These representations can then be combined for use in word recognition and other language processes. It is not easy to identify what acoustic cues listeners are sensitive to when perceiving a particular speech sound, therefore identifying the specific reasons for one listeners survey success over another’s, in terms of response to the same questions, may not be possible.

One of the basic problems in the study of speech is how to deal with noise in the speech signal. This issue is demonstrated by the difficulty computer speech recognition systems have with recognizing human speech. These programs can do well at recognizing speech when they have been trained on a specific speaker’s voice, and under quiet conditions. However, these systems often do poorly in more realistic listening situations where humans can understand speech without difficulty. When noise exceeds or matches the decibel levels of formant lines F1 or F2 (150 Hz – 5000 Hz) there will be a significant decrease in the listener’s ability to properly discern the phonemic vowel structure of the word being spoken. Under this condition the phonemic restoration effect becomes prevalent and gaps in the ambiguous word or phrase are filled via the listener’s anticipation process.

However, it should be noted that under certain conditions specific types of background noise can possibly appear to improve a listener’s speech perception ability. For example the success rate of “It’s Raining” in survey 1A may be partially attributed to the subtle sound of water in the background. Since there were no control files created for the same words without the sound of water we have no way to ascertain its effect on perception accuracy.

Although survey participant’s ages were not recorded during the experimentation process, we do feel that age may be a useful dimension to extrapolate in future experiments. Presbycusis, a progressive bilateral symmetrical age-related sensorineural hearing loss is mostly noted at higher frequencies. Although human hearing starts to deteriorate as early as 18 years of age with frequencies above 15 or 16Khz, the effects are not noticeable until later years when the detection of high pitched sounds becomes difficult and speech perception is affected, particularly of sibilants and fricatives ([s], [z], [ʃ], and [ʒ]). [4] Since sibilants and fricatives were evenly distributed among both accurately interpreted samples and those samples least identified, we do not feel that the Presbycusis effects of participant ages had any significant bearing on the statistical outcome of this research.

At first glance, the solution to the problem of how we perceive speech seems deceptively simple. If one could identify stretches of the acoustic waveform that correspond to specific units of perception, then the path from sound to meaning would be clear. However, this correspondence or mapping has proven extremely difficult to find.[1] It is our position that recorded statements analyzed without the benefit of conversational context are consequently more difficult to understand. The relationship between statements and subject matter can ultimately increase comprehension by as much as 40%.


Many individuals have difficulty accepting the fallibility of their own interpretive methods, and rightly so. Our ability to interpret linguistic communication is the very foundation of our social and educational development. It is primarily responsible for our ability to live and reason in a civilized world. However, the difference between our perception of the world and our perception of EVP lies within the context and process of communication. When humans learn, there is often more than one informational dimension present to aid in understanding. For example: Books have introductions, photos, and charts. Teachers have anecdotes, examples and demonstrations. Life has a chain of experiences, characters and objects that lead up to the moment of learning. All of these things help us to build a foundation in preparation for new information to be received and understood. Without them, even the simplest ideas would be difficult to understand. Language is no different. Random words and phrases without context are difficult to interpret regardless of the simplicity and clarity of the words being spoken. This difficulty increases with the obscurity of the word or phrase being communicated. To compensate for these failures the human mind first looks for connections to what it feels is most familiar. As a result, common words will be more easily identified (again, regardless of their simplicity). This process and result is indicated in our comparison between the statistical results of surveys 1A & 1B and the secondary surveys.

When a familiar match is not found the process of anticipation (mainly when perceiving phrases or sentences) begins. Although typically unaware, we utilize the anticipation process frequently in our everyday lives. For example: When we read the sentence “I need to walk my ____.” Most of the time our minds will automatically fill in the blank with the most common answer, (which in this case is Dog). This process is also demonstrated amongst people who appear to finish each other’s sentences.

We do this because the human speech perception process works from the top level down, meaning we first identify the words before identifying the letters and other detail elements that make them up. When context is not available to aid in the accuracy of our anticipation process we may fill in the blanks incorrectly, thus misunderstanding the communication. It is here that we are subject to the effects of pareidolia, outside influence and/or situational/confirmation biases. This biased effect can fill in the blanks with words that jive with our personal expectations, thoughts or beliefs thus over-riding the original content of the communication at hand.

When this effect occurs during EVP analysis, the result can at times create a false impression of intelligent interaction since the listener will “fill in the blanks” with words that jive with the investigation, the environment, outside input or their personal beliefs. This can lead to a contamination of objectivity. Once the listener verbally announces an erroneous interpretation of a recording, they run the danger of influencing the perception process of others through suggestive contamination. The human mind is limited in terms of quality discrimination concerning external data. It simply does not have the processing ability to be as discerning as it needs to be amongst all of its other processes. Therefore it will often incorrectly pair knowledge with sensory input based on basic phonetic and sentence structure, omitting many of the important details required for proper perception. This effect is often seen when misidentifying song lyrics. Because we are subjected to this form of faulty perception, we can be easily influenced by suggestions that closely match our expectations and / or observations.
In summary, our research suggests that without a direct contextual or situational relation to the true nature of statements contained within EVP recordings, the listeners perception of what is truly being said can be expected to be incorrect as much as 56% of the time due to various influences outlined in this paper. Therefore, personal interpretation of EVP recording content (regardless of the listener’s confidence in their listening and perception ability) is not a reliable methodology when applied to factual research.

Michael J. Baker
Scientific Research Dept., New England Centre for the Advancement of Paranormal Science, Salem, MA


  1. Nygaard, L.C., Pisoni, D.B. (1995). “Speech Perception: New Directions in Research and Theory”. In J.L. Miller, P.D. Eimas.Handbook of Perception and Cognition: Speech, Language, and Communication. San Diego: Academic Press.
  1. Liberman, A.M. & Mattingly, I.G. (1985).”The motor theory of speech perception revised. Cognition21 (1): 1–36. . Retrieved 2007-07-19..
  1. Johnson, K. (2005).”Speaker Normalization in speech perception”. In Pisoni, D.B., Remez, R.The Handbook of Speech Perception. Oxford: Blackwell Publishers. Retrieved 2007-05-17.
  1. Huang, Qi; Tang, Jianguo (13 May 2010). “Age-related hearing loss or presbycusis”.European Archives of Oto-Rhino-Laryngology267 (8): 1179–1191.
  1. Iverson, P., Kuhl, P.K., Akahane-Yamada, R., Diesh, E., Thokura, Y., Kettermann, A., Siebert, C., (2003). “A perceptual interference account of acquisition difficulties for non-native phonemes”.
  1. Halle, M., Mohanan, K.P. (1985). “Segmental phonology of modern English”. Linguistic Inquiry 16 (1): 57–116.
  1. Klatt, D.H. (1976). “Linguistic uses of segmental duration in English: Acoustic and perceptual evidence”. Journal of the Acoustical Society of America 59 (5): 1208–1221.
  1. Warren, R.M. (1970). “Restoration of missing speech sounds”.Science 167 (3917): 392–393
  1. Garnes, S., Bond, Z.S. (1976). “Phonologica 1976”. Innsbruck. pp. 285–293
  1. Jongman A, Wang Y, Kim BH (December 2003). “Contributions of semantic and facial information to perception of nonsibilant fricatives”. J. Speech Lang. Hear. Res. 46 (6): 1367–77
  1. Daniel Schacter, Daniel Gilbert, Daniel Wegner (2011). “Sensation and Perception”. In Charles Linsmeiser.Psychology. Worth Publishers. pp. 158–159
  1.  Iverson, P., Kuhl, P.K., Akahane-Yamada, R., Diesh, E., Thokura, Y., Kettermann, A., Siebert, C., (2003). “A perceptual interference account of acquisition difficulties for non-native phonemes”.Cognition89 (1): B47–B57.
  1.  Kamide, Y., University of Dundee, (2008). “Anticipatory Processes in Sentence Processing”.Language and linguistics Compass,24
  1. Nooteboom, S.G. and A.Cohen(1975) “Anticipation in speech production and its implications for perception”, In A. Cohen & S. G. Nooteboom (Eds), Structure and process in speech perception. Berlin: Springer Verlag.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.