speech to text javascript example

JavaScript Speech Recognition Example (Speech to Text)

With the Web Speech API, we can recognize speech using JavaScript . It is super easy to recognize speech in a browser using JavaScript and then getting the text from the speech to use as user input. We have already covered How to convert Text to Speech in Javascript .

But the support for this API is limited to the Chrome browser only . So if you are viewing this example in some other browser, the live example below might not work.

Javascript speech recognition - speech to text

This tutorial will cover a basic example where we will cover speech to text. We will ask the user to speak something and we will use the SpeechRecognition object to convert the speech into text and then display the text on the screen.

The Web Speech API of Javascript can be used for multiple other use cases. We can provide a list of rules for words or sentences as grammar using the SpeechGrammarList object, which will be used to recognize and validate user input from speech.

For example, consider that you have a webpage on which you show a Quiz, with a question and 4 available options and the user has to select the correct option. In this, we can set the grammar for speech recognition with only the options for the question, hence whatever the user speaks, if it is not one of the 4 options, it will not be recognized.

We can use grammar, to define rules for speech recognition, configuring what our app understands and what it doesn't understand.

JavaScript Speech to Text

In the code example below, we will use the SpeechRecognition object. We haven't used too many properties and are relying on the default values. We have a simple HTML webpage in the example, where we have a button to initiate the speech recognition.

The main JavaScript code which is listening to what user speaks and then converting it to text is this:

In the above code, we have used:

recognition.start() method is used to start the speech recognition.

Once we begin speech recognition, the onstart event handler can be used to inform the user that speech recognition has started and they should speak into the mocrophone.

When the user is done speaking, the onresult event handler will have the result. The SpeechRecognitionEvent results property returns a SpeechRecognitionResultList object. The SpeechRecognitionResultList object contains SpeechRecognitionResult objects. It has a getter so it can be accessed like an array. The first [0] returns the SpeechRecognitionResult at the last position. Each SpeechRecognitionResult object contains SpeechRecognitionAlternative objects that contain individual results. These also have getters so they can be accessed like arrays. The second [0] returns the SpeechRecognitionAlternative at position 0 . We then return the transcript property of the SpeechRecognitionAlternative object.

Same is done for the confidence property to get the accuracy of the result as evaluated by the API.

We have many event handlers, to handle the events surrounding the speech recognition process. One such event is onspeechend , which we have used in our code to call the stop() method of the SpeechRecognition object to stop the recognition process.

Now let's see the running code:

When you will run the code, the browser will ask for permission to use your Microphone , so please click on Allow and then speak anything to see the script in action.

Conclusion:

So in this tutorial we learned how we can use Javascript to write our own small application for converting speech into text and then displaying the text output on screen. We also made the whole process more interactive by using the various event handlers available in the SpeechRecognition interface. In future I will try to cover some simple web application ideas using this feature of Javascript to help you usnderstand where we can use this feature.

If you face any issue running the above script, post in the comment section below. Remember, only Chrome browser supports it .

IF YOU LIKE IT, THEN SHARE IT

How to convert speech into text using JavaScript ?

In this article, we will learn to convert speech into text using HTML and JavaScript.

Approach: We added a content editable “div” by which we make any HTML element editable.

We use the SpeechRecognition object to convert the speech into text and then display the text on the screen.

We also added WebKit Speech Recognition to perform speech recognition in Google chrome and Apple safari.

InterimResults results should be returned true and the default value of this is false. So set interimResults= true

Use appendChild() method to append a node as the last child of a node.

Add eventListener, in this event listener, map() method is used to create a new array with the results of calling a function for every array element.

Note: This method does not change the original array.

Use join() method to return array as a string.

Final Code:

Output:

If the user tells “Hello World” after running the file, it shows the following on the screen.

Please Login to comment...

Improve your Coding Skills with Practice

What kind of Experience do you want to share?

Skip to main content
Skip to search
Skip to select language
Sign up for free

Using the Web Speech API

Speech recognition.

Speech recognition involves receiving speech through a device's microphone, which is then checked by a speech recognition service against a list of grammar (basically, the vocabulary you want to have recognized in a particular app.) When a word or phrase is successfully recognized, it is returned as a result (or list of results) as a text string, and further actions can be initiated as a result.

The Web Speech API has a main controller interface for this — SpeechRecognition — plus a number of closely-related interfaces for representing grammar, results, etc. Generally, the default speech recognition system available on the device will be used for the speech recognition — most modern OSes have a speech recognition system for issuing voice commands. Think about Dictation on macOS, Siri on iOS, Cortana on Windows 10, Android Speech, etc.

Note: On some browsers, such as Chrome, using Speech Recognition on a web page involves a server-based recognition engine. Your audio is sent to a web service for recognition processing, so it won't work offline.

To show simple usage of Web speech recognition, we've written a demo called Speech color changer . When the screen is tapped/clicked, you can say an HTML color keyword, and the app's background color will change to that color.

The UI of an app titled Speech Color changer. It invites the user to tap the screen and say a color, and then it turns the background of the app that color. In this case it has turned the background red.

To run the demo, navigate to the live demo URL in a supporting mobile browser (such as Chrome).

HTML and CSS

The HTML and CSS for the app is really trivial. We have a title, instructions paragraph, and a div into which we output diagnostic messages.

The CSS provides a very simple responsive styling so that it looks OK across devices.

Let's look at the JavaScript in a bit more detail.

Prefixed properties

Browsers currently support speech recognition with prefixed properties. Therefore at the start of our code we include these lines to allow for both prefixed properties and unprefixed versions that may be supported in future:

The grammar

The next part of our code defines the grammar we want our app to recognize. The following variable is defined to hold our grammar:

The grammar format used is JSpeech Grammar Format ( JSGF ) — you can find a lot more about it at the previous link to its spec. However, for now let's just run through it quickly:

The lines are separated by semicolons, just like in JavaScript.
The first line — #JSGF V1.0; — states the format and version used. This always needs to be included first.
The second line indicates a type of term that we want to recognize. public declares that it is a public rule, the string in angle brackets defines the recognized name for this term ( color ), and the list of items that follow the equals sign are the alternative values that will be recognized and accepted as appropriate values for the term. Note how each is separated by a pipe character.
You can have as many terms defined as you want on separate lines following the above structure, and include fairly complex grammar definitions. For this basic demo, we are just keeping things simple.

Plugging the grammar into our speech recognition

The next thing to do is define a speech recognition instance to control the recognition for our application. This is done using the SpeechRecognition() constructor. We also create a new speech grammar list to contain our grammar, using the SpeechGrammarList() constructor.

We add our grammar to the list using the SpeechGrammarList.addFromString() method. This accepts as parameters the string we want to add, plus optionally a weight value that specifies the importance of this grammar in relation of other grammars available in the list (can be from 0 to 1 inclusive.) The added grammar is available in the list as a SpeechGrammar object instance.

We then add the SpeechGrammarList to the speech recognition instance by setting it to the value of the SpeechRecognition.grammars property. We also set a few other properties of the recognition instance before we move on:

SpeechRecognition.continuous : Controls whether continuous results are captured ( true ), or just a single result each time recognition is started ( false ).
SpeechRecognition.lang : Sets the language of the recognition. Setting this is good practice, and therefore recommended.
SpeechRecognition.interimResults : Defines whether the speech recognition system should return interim results, or just final results. Final results are good enough for this simple demo.
SpeechRecognition.maxAlternatives : Sets the number of alternative potential matches that should be returned per result. This can sometimes be useful, say if a result is not completely clear and you want to display a list if alternatives for the user to choose the correct one from. But it is not needed for this simple demo, so we are just specifying one (which is actually the default anyway.)

Starting the speech recognition

After grabbing references to the output <div> and the HTML element (so we can output diagnostic messages and update the app background color later on), we implement an onclick handler so that when the screen is tapped/clicked, the speech recognition service will start. This is achieved by calling SpeechRecognition.start() . The forEach() method is used to output colored indicators showing what colors to try saying.

Receiving and handling results

Once the speech recognition is started, there are many event handlers that can be used to retrieve results, and other pieces of surrounding information (see the SpeechRecognition events .) The most common one you'll probably use is the result event, which is fired once a successful result is received:

The second line here is a bit complex-looking, so let's explain it step by step. The SpeechRecognitionEvent.results property returns a SpeechRecognitionResultList object containing SpeechRecognitionResult objects. It has a getter so it can be accessed like an array — so the first [0] returns the SpeechRecognitionResult at position 0. Each SpeechRecognitionResult object contains SpeechRecognitionAlternative objects that contain individual recognized words. These also have getters so they can be accessed like arrays — the second [0] therefore returns the SpeechRecognitionAlternative at position 0. We then return its transcript property to get a string containing the individual recognized result as a string, set the background color to that color, and report the color recognized as a diagnostic message in the UI.

We also use the speechend event to stop the speech recognition service from running (using SpeechRecognition.stop() ) once a single word has been recognized and it has finished being spoken:

Handling errors and unrecognized speech

The last two handlers are there to handle cases where speech was recognized that wasn't in the defined grammar, or an error occurred. The nomatch event seems to be supposed to handle the first case mentioned, although note that at the moment it doesn't seem to fire correctly; it just returns whatever was recognized anyway:

The error event handles cases where there is an actual error with the recognition successfully — the SpeechRecognitionErrorEvent.error property contains the actual error returned:

Speech synthesis

Speech synthesis (aka text-to-speech, or TTS) involves receiving synthesizing text contained within an app to speech, and playing it out of a device's speaker or audio output connection.

The Web Speech API has a main controller interface for this — SpeechSynthesis — plus a number of closely-related interfaces for representing text to be synthesized (known as utterances), voices to be used for the utterance, etc. Again, most OSes have some kind of speech synthesis system, which will be used by the API for this task as available.

To show simple usage of Web speech synthesis, we've provided a demo called Speak easy synthesis . This includes a set of form controls for entering text to be synthesized, and setting the pitch, rate, and voice to use when the text is uttered. After you have entered your text, you can press Enter / Return to hear it spoken.

UI of an app called speak easy synthesis. It has an input field in which to input text to be synthesized, slider controls to change the rate and pitch of the speech, and a drop down menu to choose between different voices.

To run the demo, navigate to the live demo URL in a supporting mobile browser.

The HTML and CSS are again pretty trivial, containing a title, some instructions for use, and a form with some simple controls. The <select> element is initially empty, but is populated with <option> s via JavaScript (see later on.)

Let's investigate the JavaScript that powers this app.

Setting variables

First of all, we capture references to all the DOM elements involved in the UI, but more interestingly, we capture a reference to Window.speechSynthesis . This is API's entry point — it returns an instance of SpeechSynthesis , the controller interface for web speech synthesis.

Populating the select element

To populate the <select> element with the different voice options the device has available, we've written a populateVoiceList() function. We first invoke SpeechSynthesis.getVoices() , which returns a list of all the available voices, represented by SpeechSynthesisVoice objects. We then loop through this list — for each voice we create an <option> element, set its text content to display the name of the voice (grabbed from SpeechSynthesisVoice.name ), the language of the voice (grabbed from SpeechSynthesisVoice.lang ), and -- DEFAULT if the voice is the default voice for the synthesis engine (checked by seeing if SpeechSynthesisVoice.default returns true .)

We also create data- attributes for each option, containing the name and language of the associated voice, so we can grab them easily later on, and then append the options as children of the select.

Older browser don't support the voiceschanged event, and just return a list of voices when SpeechSynthesis.getVoices() is fired. While on others, such as Chrome, you have to wait for the event to fire before populating the list. To allow for both cases, we run the function as shown below:

Speaking the entered text

Next, we create an event handler to start speaking the text entered into the text field. We are using an onsubmit handler on the form so that the action happens when Enter / Return is pressed. We first create a new SpeechSynthesisUtterance() instance using its constructor — this is passed the text input's value as a parameter.

Next, we need to figure out which voice to use. We use the HTMLSelectElement selectedOptions property to return the currently selected <option> element. We then use this element's data-name attribute, finding the SpeechSynthesisVoice object whose name matches this attribute's value. We set the matching voice object to be the value of the SpeechSynthesisUtterance.voice property.

Finally, we set the SpeechSynthesisUtterance.pitch and SpeechSynthesisUtterance.rate to the values of the relevant range form elements. Then, with all necessary preparations made, we start the utterance being spoken by invoking SpeechSynthesis.speak() , passing it the SpeechSynthesisUtterance instance as a parameter.

In the final part of the handler, we include a pause event to demonstrate how SpeechSynthesisEvent can be put to good use. When SpeechSynthesis.pause() is invoked, this returns a message reporting the character number and name that the speech was paused at.

Finally, we call blur() on the text input. This is mainly to hide the keyboard on Firefox OS.

Updating the displayed pitch and rate values

The last part of the code updates the pitch / rate values displayed in the UI, each time the slider positions are moved.

DEV Community

Posted on Sep 12, 2021

Building a Speech to Text App with JavaScript

Introduction.

This article will cover how to build a speech-to-text application JavaScript using the Web Speech Recognition API. Speech to text is the conversion of spoken words to text. You might have been looking for ways to convert spoken words to text or wondered if it’s possible and how you can do it. This article will answer all the questions you have.

Getting Started

Building the user interface.

Create a file named speech.html and paste the following code inside:

In the code above, we designed the user interface for our speech-to-text application using HTML and CSS. However, for simplicity, we are going to use Bootstrap to ease the designing process.

We also created a button that will trigger the speech-to-text converter code, and the result is shown in the textarea we created above.

You should get a response similar to the image below when you open the speech.html file in your browser:

Writing the JavaScript Code

Create a file named script.js and paste the following code inside:

In the code above, we invoked the Web Speech Recognition API and initialized an instance stored in the recognition variable.

After this, we made references to our #textbox and #instructions elements we defined in the HTML using JQuery to control them from our code.

We also created a content variable that keeps track of text the application has converted and displayed in the textarea from the HTML file. We are initializing it to an empty string because we have not converted anything yet.

We then set the continuous variable of the recognition object to true . Thus, we are making the API continuously listen for input from the user’s microphone.

We created an event handler triggered whenever the user clicks on the Start button to start recognizing. When this happens, the recognition API is begun and will listen for input from the user.

When you press the button, your browser will request permission to use your microphone, as shown in the image below.

We also added a couple of event handlers to the recognition object to bring our application to life. They are onstart, onspeechend, onerror, and onresult.

The onstart event handler is triggered when the recognition API starts and has microphone access. Here, we programmed our application to inform the user that voice recognition is on and converts speech to text.

Next, we will write code for the onresult event handler. This event is triggered when the recognition API has successfully converted speech from the user’s microphone to text, and the data is made available via the event.results variable.

In this function, we will fetch the transcript of the speech given to us by the event.results variable, then update our previous content variable and textarea with the new results.

Now, the application is complete. If you click the Start button, you will see that it will automatically convert whatever you speak into text and fill the transcribed text inside the textbox.

We also created the onerror event handler triggered when an error occurs while transcribing the speech. If any error occurs during this process, our application will inform the user via the instruction box.

We also created the onspeechend event handler triggered when there is no input from the microphone, and the application is in an idle state. When this happens, our application will inform the user via the instruction box.

In this article, we built a speech-to-text application with Javascript. We saw how easy it is to create a user interface interactive using HTML, CSS, and JavaScript, also saw how to import our web assets from CDNs, and got the chance to make our application able to convert speech to text successfully.

Top comments (0)

Templates let you quickly answer FAQs or store snippets for re-use.

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink .

Hide child comments as well

For further actions, you may consider blocking this person and/or reporting abuse

The Future of Work: Embracing AI and Automation

Vuelancer - Sep 12

Day 28 - Secure Your Kubernetes Deployments: A Beginner's Guide to Anchore

Arbythecoder - Sep 12

10 Developer Tools to Improve Your Daily Life as a Developer

balrajOla - Sep 12

2024’s Top Web Development Trends Decoded for Your Success 🚀

Info general Hazedawn - Sep 12

We're a place where coders share, stay up-to-date and grow their careers.

Advertise with us
Explore by categories
Free Online Developer Tools
Privacy Policy
Comment Policy

How to convert voice to text with javascript (webkitSpeechRecognition API) easily

Carlos Delgado

February 23, 2016

Learn how to use the webkitSpeechRecognition API to convert

About the webkitSpeechRecognition API

The Web Speech API, introduced at the end of 2012, allows web developers to provide speech input and text-to-speech output features in a web browser. Typically, these features aren’t available when using standard speech recognition or screen reader software. This API takes care of the privacy of the users. Before allowing the website to access the voice via microphone, the user must explicitly grant permission.

Some important points you need to know :

It is only available till the date (23.02.2016) only in Google Chrome.
Local files (file:// protocol) are not allowed, the file needs to be hoster someway in a server (or localhost).

Basic example

The following code will do the most basic support to retrieve what the user says, you can use interim_transcript and final_transcript to show the user the recognized text.

The repository in github of google have a very complete example (with many language codes, prevent errors etc) you can download the demo from the repository here .

Using a library

Artyom.js is a robust wrapper library for the webkitSpeechRecognition API, it allows you to do awesome tricks like voice commands, voice prompt, speech synthesis and many more features. In this case we will be interested in the artyom.newDictation function. This feature will wrap all the previous code in something more simple, first you need to include the library into your project, your html file should look like this :

If you already linked the artyom library in your document, then your javascript will look something like this:

You'll only have to handle the initialization and then , the magic will happen in the onResult property of the settings object. Although artyom makes the things a lot easier, you'll need to think if you really need to use it, if you're beginning with this topic, is recommendable to use the plain code, so you will understand how this api works and if you still interested you can use artyom later.

The potential of this api is really incredible, however is a shame that only google chrome supports it. You can improve all the previous code , for example detect in which browser you can initialize webkitSpeechRecogniton.

Senior Software Engineer at Software Medico . Interested in programming since he was 14 years old, Carlos is a self-taught programmer and founder and author of most of the articles at Our Code World.

How to convert (synthesize) text to speech in Node.js

November 20, 2016
24.5K views

How to use the Speech Recognition API (convert voice to text) in Cordova

February 28, 2017
35.3K views

How to convert text to speech (speech synthesis) in Cordova

February 02, 2017
19.6K views

How to convert PDF to Text (extract text from PDF) with JavaScript

March 05, 2017
116.9K views

How to convert a text address into geographic coordinates using Google's Geocoding API

March 19, 2021
22.8K views

How to add voice commands to your webpage with javascript

February 15, 2016
34.6K views

Advertising

Converting from Speech to Text with JavaScript

In this tutorial we are going to experiment with the Web Speech API . It's a very powerful browser interface that allows you to record human speech and convert it into text. We will also use it to do the opposite - reading out strings in a human-like voice.

Let's jump right in!

To showcase the ability of the API we are going to build a simple voice-powered note app. It does 3 things:

Takes notes by using voice-to-text or traditional keyboard input.
Saves notes to localStorage.
Shows all notes and gives the option to listen to them via Speech Synthesis.

We won't be using any fancy dependencies, just good old jQuery for easier DOM operations and Shoelace for CSS styles. We are going to include them directly via CDN, no need to get NPM involved for such a tiny project.

The HTML and CSS are pretty standard so we are going to skip them and go straight to the JavaScript. To view the full source code go to the Download button near the top of the page.

Speech to Text

The Web Speech API is actually separated into two totally independent interfaces. We have SpeechRecognition for understanding human voice and turning it into text (Speech -> Text) and SpeechSynthesis for reading strings out loud in a computer generated voice (Text -> Speech). We'll start with the former.

The Speech Recognition API is surprisingly accurate for a free browser feature. It recognized correctly almost all of my speaking and knew which words go together to form phrases that make sense. It also allows you to dictate special characters like full stops, question marks, and new lines.

The first thing we need to do is check if the user has access to the API and show an appropriate error message. Unfortunately, the speech-to-text API is supported only in Chrome and Firefox (with a flag), so a lot of people will probably see that message.

The recognition variable will give us access to all the API's methods and properties. There are various options available but we will only set recognition.continuous to true. This will enable users to speak with longer pauses between words and phrases.

Before we can use the voice recognition, we also have to set up a couple of event handlers. Most of them simply listen for changes in the recognition status:

There is, however, a special onresult event that is very crucial. It is executed every time the user speaks a word or several words in quick succession, giving us access to a text transcription of what was said.

When we capture something with the onresult handler we save it in a global variable and display it in a textarea:

The above code is slightly simplified. There is a very weird bug on Android devices that causes everything to be repeated twice. There is no official solution yet but we managed to solve the problem without any obvious side effects. With that bug in mind the code looks like this:

Once we have everything set up we can start using the browser's voice recognition feature. To start it simply call the start() method:

This will prompt users to give permission. If such is granted the device's microphone will be activated.

Most APIs that require user permission don't work on non-secure hosts. Make sure you are serving your Web Speech apps over HTTPS.

The browser will listen for a while and every recognized phrase or word will be transcribed. The API will stop listening automatically after a couple seconds of silence or when manually stopped.

With this, the speech-to-text portion of our app is complete! Now, let's do the opposite!

Text to Speech

Speech Synthesys is actually very easy. The API is accessible through the speechSynthesis object and there are a couple of methods for playing, pausing and other audio related stuff. It also has a couple of cool options that change the pitch, rate, and even the voice of the reader.

All we will actually need for our demo is the speak() method. It expects one argument, an instance of the beautifully named SpeechSynthesisUtterance class.

Here is the entire code needed to read out a string.

When this function is called, a robot voice will read out the given string, doing it's best human impression.

In an era where voice assistants are more popular then ever, an API like this gives you a quick shortcut to building bots that understand and speak human language.

Adding voice control to your apps can also be a great form of accessibility enhancement. Users with visual impairment can benefit from both speech-to-text and text-to-speech user interfaces.

The speech synthesis and speech recognition APIs work pretty well and handle different languages and accents with ease. Sadly, they have limited browser support for now which narrows their usage in production. If you need a more reliable form of speech recognition, take a look at these third-party APIs:

Google Cloud Speech API
Bing Speech API
CMUSphinx and it's JavaScript version Pocketsphinx (both open-source).
API.AI - Free Google API powered by Machine Learning

Bootstrap Studio

The revolutionary web design tool for creating responsive websites and apps.

The easiest way to add validation to your forms, neon text effect with jquery & css, generating files with javascript, converting jquery code to a plugin, the best javascript and css libraries for 2017, quick tip: detecting your location with javascript, comments 18.

Excellent article and very easy to follow.

I got to thinking, why not add CKEditor to the textarea and did so after downloading demo. Everything looks good, even the recording works and notes can be saved. BUT, the spoken wrods are not appearing in the CKEditor area.

When you have time, could you possible cover getting your tutorial working with CKEditor? Be a really nice addition to have:)

Thanks for sharing your notes tutorial!

THIS IS MY CURRENT ISSUE RIGHT NOW. DID YOU SOLVED IT MAN ?

Great tutorial! Just wondering why doesn't the Start Recognition button work if I copy the code to a Codeanywhere project?

excellent article. i didn't have idea that such API exists in browser. tried example in chrome , worked fine. how this work for non-english speaking language?

Both text-to-speech and speech-to-text work pretty well with other languages.

For speech recognition you have to set the recognition.lang property .

With speech synthesis you can change the speaking voice. There is a large list of different languages to choose from - getVoices .

Sorry, it doesn't work.

It's problem with your secure connection. Check Is it https or not ?

This is working excellent on my local server i have implemented this and uploaded it on hosting server when i start the recognition the page says , this page is always blocked from using the microphone . Can you help me ?

i am working on speech recognition for Android applications. please suggest me can i use it in android and how can i get jar file for artyom.js?

Nice tutorial!

Is there a way to change language for the recognizer?

previously I had one already developed, but this one started to give me problems in the activation of the microphone, so I've been looking for solutions, I see that this example works very well, but when downloading the code and even make a copy and use the current one, it does not work for me, I put it on my server and nothing, so I tried to put it in a hosting to see if the server was the problem, and in the same way it does not connect, the microphone is not activated, which is what I will be missing for that this example works well, since I can not make functional the activation of the microphone, unless it enters by localhost, there is its that allows me to activate the microphone.

Firefox (with a flag) what about flag ?

It's working perfectly fine on Chrome... but, what changes we need to do so that it will work on all browsers..

Excellent articles and easy to understand and also easy to implement

why doesn't the Start Recognition button work if I copy the code in visual studio code ?

this is great 💥💥💥

great tutorial and library thank you so much for it, but actually i have problem when recognizing languages other than English have you made this library to work with other languages. my case: i am trying to recognize Arabic and turn an Arabic speech to text but the Arabic written in English letters and that is not correct

Bootstrap Studio Exclusive

This snippet is available only in Bootstrap Studio, the most powerful drag and drop website builder ever.

Oops! Something went wrong :( Try again...

Build a Speech-to-text Web App with Whisper, React and Node

Share this article

Introducing Whisper

Prerequisites, setting up the project, integrating whisper, installing ffmpeg, trim audio in the code, the frontend, frequently asked questions (faqs) about speech-to-text with whisper, react, and node.

In this article, we’ll build a speech-to-text application using OpenAI’s Whisper, along with React, Node.js, and FFmpeg. The app will take user input, synthesize it into speech using OpenAI’s Whisper API, and output the resulting text. Whisper gives the most accurate speech-to-text transcription I’ve used, even for a non-native English speaker.

Installing FFmpeg for Windows

Installing ffmpeg for macos, installing ffmpeg for linux, creating the timepicker component, the main component.

OpenAI explains that Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the Web.

Text is easier to search and store than audio. However, transcribing audio to text can be quite laborious. ASRs like Whisper can detect speech and transcribe the audio to text with a high level of accuracy and very quickly, making it a particularly useful tool.

This article is aimed at developers who are familiar with JavaScript and have a basic understanding of React and Express.

If you want to build along, you’ll need an API key. You can obtain one by signing up for an account on the OpenAI platform. Once you have an API key, make sure to keep it secure and not share it publicly.

We’ll be building the frontend of this app with Create React App (CRA). All we’ll be doing in the frontend is uploading files, picking time boundaries, making network requests and managing a few states. I chose CRA for simplicity. Feel free to use any frontend library you prefer or even plain old JS. The code should be mostly transferable.

For the backend, we’ll be using Node.js and Express, just so we can stick with a full JS stack for this app. You can use Fastify or any other alternative in place of Express and you should still be able to follow along.

Note: in order to keep this article focussed on the subject, long blocks of code will be linked to, so we can focus on the real tasks at hand.

We start by creating a new folder that will contain both the frontend and backend for the project for organizational purposes. Feel free to choose any other structure you prefer:

Next, we initialize a new React application using create-react-app :

Navigate to the new frontend folder and install axios to make network requests and react-dropzone for file upload with the code below:

Now, let’s switch back into the main folder and create the backend folder:

Next, we initialize a new Node application in our backend directory, while also installing the required libraries:

In the code above, we’ve installed the following libraries:

dotenv : necessary to keep our OpenAI API key away from the source code.
cors : to enable cross-origin requests.
multer : middleware for uploading our audio files. It adds a .file or .files object to the request object, which we’ll then access in our route handlers.
form-data : to programmatically create and submit forms with file uploads and fields to a server.
axios : to make network requests to the Whisper endpoint.

Also, since we’ll be using FFmpeg for audio trimming, we have these libraries:

fluent-ffmpeg : this provides a fluent API to work with the FFmpeg tool, which we’ll use for audio trimming.
ffmetadata : this is used for reading and writing metadata in media files. We need it to retrieve the audio duration.
ffmpeg-static : this provides static FFmpeg binaries for different platforms, and simplifies deploying FFmpeg.

Our entry file for the Node.js app will be index.js . Create the file inside the backend folder and open it in a code editor. Let’s wire up a basic Express server:

Update package.json in the backend folder to include start and dev scripts:

The above code simply registers a simple GET route. When we run npm run dev and go to localhost:3001 or whatever our port is, we should see the welcome text.

Now it’s time to add the secret sauce! In this section, we’ll:

accept a file upload on a POST route
convert the file to a readable stream
very importantly, send the file to Whisper for transcription
send the response back as JSON

Let’s now create a .env file at the root of the backend folder to store our API Key, and remember to add it to gitignore :

First, let’s import some of the libraries we need to update file uploads, network requests and streaming:

Next, we’ll create a simple utility function to convert the file buffer into a readable stream that we’ll send to Whisper:

We’ll create a new route, /api/transcribe , and use axios to make a request to OpenAI.

First, import axios at the top of the app.js file: const axios = require('axios'); .

Then, create the new route, like so:

In the code above, we use the utility function bufferToStream to convert the audio file buffer into a readable stream, then send it over a network request to Whisper and await the response, which is then sent back as a JSON response.

You can check the docs for more on the request and response for Whisper.

We’ll add additional functionality below to allow the user to transcribe a part of the audio. To do this, our API endpoint will accept startTime and endTime , after which we’ll trim the audio with ffmpeg .

To install FFmpeg for Windows, follow the simple steps below:

Visit the FFmpeg official website’s download page here .
Under the Windows icon there are several links. Choose the link that says “Windows Builds”, by gyan.dev.
Download the build that corresponds to our system (32 or 64 bit). Make sure to download the “static” version to get all the libraries included.
Extract the downloaded ZIP file. We can place the extracted folder wherever we prefer.
To use FFmpeg from the command line without having to navigate to its folder, add the FFmpeg bin folder to the system PATH.

If we’re on macOS, we can install FFmpeg with Homebrew:

If we’re on Linux, we can install FFmpeg with apt , dnf or pacman , depending on our Linux distribution. Here’s the command for installing with apt :

Why do we need to trim the audio? Say a user has an hour-long audio file and only wants to transcribe from the 15-minute mark to 45-minute mark. With FFmpeg, we can trim to the exact startTime and endTime , before sending the trimmed stream to Whisper for transcription.

First, we’ll import the the following libraries:

fluent-ffmpeg is a Node.js module that provides a fluent API for interacting with FFmpeg.
ffmetadata will be used to read the metadata of the audio file — specifically, the duration .
ffmpeg.setFfmpegPath(ffmpegPath) is used to explicitly set the path to the FFmpeg binary.

Next, let’s create a utility function to convert time passed as mm:ss into seconds. This can be outside of our app.post route, just like the bufferToStream function:

Next, we should update our app.post route to do the following:

accept the startTime and endTime
calculate the duration
deal with basic error handling
convert audio buffer to stream
trim audio with FFmpeg
send the trimmed audio to OpenAI for transcription

The trimAudio function trims an audio stream between a specified start time and end time, and returns a promise that resolves with the trimmed audio data. If an error occurs at any point in this process, the promise is rejected with that error.

Let’s break down the function step by step.

Define the trim audio function . The trimAudio function is asynchronous and accepts the audioStream and endTime as arguments. We define temporary filenames for processing the audio:

Write stream to a temporary file . We write the incoming audio stream into a temporary file using fs.createWriteStream() . If there’s an error, the Promise gets rejected:

Read metadata and set endTime . After the audio stream finishes writing to the temporary file, we read the metadata of the file using ffmetadata.read() . If the provided endTime is longer than the audio duration, we adjust endTime to be the duration of the audio:

Trim Audio using FFmpeg . We utilize FFmpeg to trim the audio based on the start time ( startSeconds ) received and duration ( timeDuration ) calculated earlier. The trimmed audio is written to the output file:

Delete temporary files and resolve promise . After trimming the audio, we delete the temporary file and read the trimmed audio into a buffer. We also delete the output file using the Node.js file system after reading it to the buffer. If everything goes well, the Promise gets resolved with the trimmedAudioBuffer . In case of an error, the Promise gets rejected:

The full code for the endpoint is available in this GitHub repo .

The styling will be done with Tailwind, but I won’t cover setting up Tailwind. You can read about how to set up and use Tailwind here .

Since our API accepts startTime and endTime , let’s create a TimePicker component with react-select . Using react-select simply adds other features to the select menu like searching the options, but it’s not critical to this article and can be skipped.

Let’s break down the TimePicker React component below:

Imports and component declaration . First, we import necessary packages and declare our TimePicker component. The TimePicker component accepts the props id , label , value , onChange , and maxDuration :

Parse the value prop . The value prop is expected to be a time string (format HH:MM:SS ). Here we split the time into hours, minutes, and seconds:

Calculate maximum values . maxDuration is the maximum time in seconds that can be selected, based on audio duration. It’s converted into hours, minutes, and seconds:

Options for time selects . We create arrays for possible hours, minutes, and seconds options, and state hooks to manage the minute and second options:

Update value function . This function updates the current value by calling the onChange function passed in as a prop:

Update minute and second options function . This function updates the minute and second options depending on the selected hours and minutes:

Effect Hook . This calls updateMinuteAndSecondOptions when hours or minutes change:

Helper functions . These two helper functions convert time integers to select options and vice versa:

Render . The render function displays the time picker, which consists of three dropdown menus (hours, minutes, seconds) managed by the react-select library. Changing the value in the select boxes will call updateValue and updateMinuteAndSecondOptions , which were explained above.

You can find the full source code of the TimePicker component on GitHub .

Now let’s build the main frontend component by replacing App.js .

The App component will implement a transcription page with the following functionalities:

Define helper functions for time format conversion.
Update startTime and endTime based on selection from the TimePicker component.
Define a getAudioDuration function that retrieves the duration of the audio file and updates the audioDuration state.
Handle file uploads for the audio file to be transcribed.
Define a transcribeAudio function that sends the audio file by making an HTTP POST request to our API.
Render UI for file upload.
Render TimePicker components for selecting startTime and endTime .
Display notification messages.
Display the transcribed text.

Let’s break this component down into several smaller sections:

Imports and helper functions . Import necessary modules and define helper functions for time conversions:

Component declaration and state hooks . Declare the TranscriptionPage component and initialize state hooks:

Event handlers . Define various event handlers — for handling start time change, getting audio duration, handling file drop, and transcribing audio:

Use the Dropzone hook . Use the useDropzone hook from the react-dropzone library to handle file drops:

Render . Finally, render the component. This includes a dropzone for file upload, TimePicker components for setting start and end times, a button for starting the transcription process, and a display for the resulting transcription.

The transcribeAudio function is an asynchronous function responsible for sending the audio file to a server for transcription. Let’s break it down:

Here’s a more detailed look:

setUploading(true); . This line sets the uploading state to true , which we use to indicate to the user that the transcription process has started.

const formData = new FormData(); . FormData is a web API used to send form data to the server. It allows us to send key–value pairs where the value can be a Blob, File or a string.

The audioFile is appended to the formData object, provided it’s not null ( audioFile && formData.append('file', audioFile); ). The start and end times are also appended to the formData object, but they’re converted to MM:SS format first.

The axios.post method is used to send the formData to a server endpoint ( http://localhost:3001/api/transcribe ). Change http://localhost:3001 to the server address. This is done with an await keyword, meaning that the function will pause and wait for the Promise to be resolved or be rejected.

If the request is successful, the response object will contain the transcription result ( response.data.transcription ). This is then set to the transcription state using the setTranscription function. A successful toast notification is then shown.

If an error occurs during the process, an error toast notification is shown.

In the finally block, regardless of the outcome (success or error), the uploading state is set back to false to allow the user to try again.

In essence, the transcribeAudio function is responsible for coordinating the entire transcription process, including handling the form data, making the server request, and handling the server response.

You can find the full source code of the App component on GitHub .

We’ve reached the end and now have a full web application that transcribes speech to text with the power of Whisper.

We could definitely add a lot more functionality, but I’ll let you build the rest on your own. Hopefully we’ve gotten you off to a good start.

Here’s the full source code:

Backend repo on GitHub
Frontend repo on GitHub

What is Whisper and how does it work with React and Node?

Whisper is an automatic speech recognition (ASR) system developed by OpenAI. It’s designed to convert spoken language into written text. When used with React and Node, Whisper can provide real-time transcription services for applications. React, a JavaScript library for building user interfaces, can display the transcriptions on the front-end, while Node.js, a back-end JavaScript runtime, can handle the server-side operations such as sending audio data to Whisper and receiving transcriptions.

How can I install and set up Whisper in my React and Node project?

To install Whisper, you need to clone the Whisper ASR repository from GitHub. After cloning, you can install the necessary dependencies using npm or yarn. Setting up Whisper in your project involves configuring your server-side code (Node.js) to send audio data to Whisper and receive transcriptions. On the front-end (React), you need to set up components to display the transcriptions.

Can I use Whisper for languages other than English?

Currently, Whisper primarily supports English. However, OpenAI is continuously working on improving and expanding the capabilities of Whisper, so support for other languages may be added in the future.

How accurate is Whisper in transcribing speech to text?

Whisper is designed to be highly accurate in transcribing speech to text. However, like any ASR system, its accuracy can be influenced by factors such as the clarity of the speech, background noise, and the speaker’s accent.

How can I improve the accuracy of Whisper’s transcriptions?

You can improve the accuracy of Whisper’s transcriptions by ensuring clear and distinct speech, minimizing background noise, and using a high-quality microphone. Additionally, you can customize Whisper’s settings to better suit your specific use case.

Is Whisper suitable for real-time transcription in a production environment?

Yes, Whisper is designed to handle real-time transcription in a production environment. Its performance can be optimized by properly configuring your server-side code and ensuring a stable internet connection.

Can I use Whisper for offline transcription?

Currently, Whisper requires an internet connection to function as it needs to communicate with the OpenAI servers for transcription. Offline functionality is not available at this time.

How can I handle errors or issues when using Whisper?

When using Whisper, you can handle errors or issues by implementing error handling in your code. This can involve catching and logging errors, retrying operations, and providing user-friendly error messages.

Is there a cost associated with using Whisper?

As of now, Whisper is an open-source project and can be used free of charge. However, it’s always a good idea to check the official OpenAI website for any updates regarding pricing.

Can I contribute to the development of Whisper?

Yes, as an open-source project, contributions to the development of Whisper are welcome. You can contribute by submitting pull requests on GitHub, reporting issues, or suggesting improvements.

Abiodun Sulaiman is a seasoned full-stack developer with a decade of hands-on experience in the JavaScript ecosystem. His expertise spans across web and mobile applications, making him adept at navigating complex project requirements and delivering robust solutions.

Español – América Latina
Português – Brasil
Tiếng Việt

Using the Speech-to-Text API with Node.js

1. overview.

Google Cloud Speech-to-Text API enables developers to convert audio to text in 120 languages and variants, by applying powerful neural network models in an easy to use API.

In this codelab, you will focus on using the Speech-to-Text API with Node.js. You will learn how to send an audio file in English and other languages to the Cloud Speech-to-Text API for transcription.

What you'll learn

How to enable the Speech-to-Text API
How to Authenticate API requests
How to install the Google Cloud client library for Node.js
How to transcribe audio files in English
How to transcribe audio files with word timestamps
How to transcribe audio files in different languages

What you'll need

A Google Cloud Platform Project
A Browser, such Chrome or Firefox
Familiarity using Javascript/Node.js

How will you use this tutorial?

How would you rate your experience with node.js, how would you rate your experience with using google cloud platform services, 2. setup and requirements, self-paced environment setup.

Sign in to Cloud Console and create a new project or reuse an existing one. (If you don't already have a Gmail or G Suite account, you must create one .)

dMbN6g9RawQj_VXCSYpdYncY-DbaRzr2GbnwoV7jFf1u3avxJtmGPmKpMYgiaMH-qu80a_NJ9p2IIXFppYk8x3wyymZXavjglNLJJhuXieCem56H30hwXtd8PvXGpXJO9gEUDu3cZw

Remember the project ID, a unique name across all Google Cloud projects (the name above has already been taken and will not work for you, sorry!). It will be referred to later in this codelab as PROJECT_ID .

Next, you'll need to enable billing in Cloud Console in order to use Google Cloud resources.

Running through this codelab shouldn't cost much, if anything at all. Be sure to to follow any instructions in the "Cleaning up" section which advises you how to shut down resources so you don't incur billing beyond this tutorial. New users of Google Cloud are eligible for the $300USD Free Trial program.

Start Cloud Shell

While Google Cloud can be operated remotely from your laptop, in this codelab you will be using Google Cloud Shell , a command line environment running in the Cloud.

Activate Cloud Shell

H7JlbhKGHITmsxhQIcLwoe5HXZMhDlYue4K-SPszMxUxDjIeWfOHBfxDHYpmLQTzUmQ7Xx8o6OJUlANnQF0iBuUyfp1RzVad_4nCa0Zz5LtwBlUZFXFCWFrmrWZLqg1MkZz2LdgUDQ

If you've never started Cloud Shell before, you'll be presented with an intermediate screen (below the fold) describing what it is. If that's the case, click Continue (and you won't ever see it again). Here's what that one-time screen looks like:

kEPbNAo_w5C_pi9QvhFwWwky1cX8hr_xEMGWySNIoMCdi-Djx9AQRqWn-__DmEpC7vKgUtl-feTcv-wBxJ8NwzzAp7mY65-fi2LJo4twUoewT1SUjd6Y3h81RG3rKIkqhoVlFR-G7w

It should only take a few moments to provision and connect to Cloud Shell.

pTv5mEKzWMWp5VBrg2eGcuRPv9dLInPToS-mohlrqDASyYGWnZ_SwE-MzOWHe76ZdCSmw0kgWogSJv27lrQE8pvA5OD6P1I47nz8vrAdK7yR1NseZKJvcxAZrPb8wRxoqyTpD-gbhA

This virtual machine is loaded with all the development tools you'll need. It offers a persistent 5GB home directory and runs in Google Cloud, greatly enhancing network performance and authentication. Much, if not all, of your work in this codelab can be done with simply a browser or your Chromebook.

Once connected to Cloud Shell, you should see that you are already authenticated and that the project is already set to your project ID.

Run the following command in Cloud Shell to confirm that you are authenticated:

Command output

If it is not, you can set it with this command:

3. Enable the Speech-to-Text API

Before you can begin using the Speech-to-Text API, you must enable the API. You can enable the API by using the following command in the Cloud Shell:

4. Authenticate API requests

In order to make requests to the Speech-to-Text API, you need to use a Service Account . A Service Account belongs to your project and it is used by the Google Client Node.js library to make Speech-to-Text API requests. Like any other user account, a service account is represented by an email address. In this section, you will use the Cloud SDK to create a service account and then create credentials you will need to authenticate as the service account.

First, set an environment variable with your PROJECT_ID which you will use throughout this codelab, if you are using Cloud Shell this will be set for you:

Next, create a new service account to access the Speech-to-Text API by using:

Next, create credentials that your Node.js code will use to login as your new service account. Create these credentials and save it as a JSON file ~/key.json by using the following command:

Finally, set the GOOGLE_APPLICATION_CREDENTIALS environment variable, which is used by the Speech-to-Text API Node.js library, covered in the next step, to find your credentials. The environment variable should be set to the full path of the credentials JSON file you created, by using:

You can read more about authenticating the Speech-to-Text API .

5. Install the Google Cloud Speech-to-Text API client library for Node.js

First, create a project that you will use to run this Speech-to-Text API lab, initialize a new Node.js package in a folder of your choice:

NPM asks several questions about the project configuration, such as name and version. For each question, press ENTER to accept the default values. The default entry point is a file named index.js .

Next, install the Google Cloud Speech library to the project:

For more instructions on how to set up a Node.js development for Google Cloud please see the Setup Guide .

Now, you're ready to use Speech-to-Text API!

6. Transcribe Audio Files

In this section, you will transcribe a pre-recorded audio file in English. The audio file is available on Google Cloud Storage.

Navigate to the index.js file inside the and replace the code with the following:

Take a minute or two to study the code and see it is used to transcribe an audio file*.*

The Encoding parameter tells the API which type of audio encoding you're using for the audio file. Flac is the encoding type for .raw files (see the doc for encoding type for more details).

In the RecognitionAudio object, you can pass the API either the uri of our audio file in Cloud Storage or the local file path for the audio file. Here, we're using a Cloud Storage uri.

Run the program:

You should see the following output:

7. Transcribe with word timestamps

Speech-to-Text can detect time offset (timestamp) for the transcribed audio. Time offsets show the beginning and end of each spoken word in the supplied audio. A time offset value represents the amount of time that has elapsed from the beginning of the audio, in increments of 100ms.

Take a minute or two to study the code and see it is used to transcribe an audio file with word timestamps*.* The EnableWordTimeOffsets parameter tells the API to enable time offsets (see the doc for more details).

Run your program again:

8. Transcribe different languages

Speech-to-Text API supports transcription in over 100 languages! You can find a list of supported languages here .

In this section, you will transcribe a pre-recorded audio file in French. The audio file is available on Google Cloud Storage.

Run your program again and you should see the following output:

This is a sentence from a popular French children's tale .

For the full list of supported languages and language codes, see the documentation here .

9. Congratulations!

You learned how to use the Speech-to-Text API using Node.js to perform different kinds of transcription on audio files!

To avoid incurring charges to your Google Cloud Platform account for the resources used in this quickstart:

Go to the Cloud Platform Console .
Select the project you want to shut down, then click ‘Delete' at the top: this schedules the project for deletion.
Google Cloud Speech-to-Text API: https://cloud.google.com/speech-to-text/docs
Node.js on Google Cloud Platform: https://cloud.google.com/nodejs/
Google Cloud Node.js client: https://googlecloudplatform.github.io/google-cloud-node/

This work is licensed under a Creative Commons Attribution 2.0 Generic License.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 4.0 License , and code samples are licensed under the Apache 2.0 License . For details, see the Google Developers Site Policies . Java is a registered trademark of Oracle and/or its affiliates.

Español – América Latina
Português – Brasil
Tiếng Việt
Chrome for Developers

Voice driven web apps - Introduction to the Web Speech API

The new JavaScript Web Speech API makes it easy to add speech recognition to your web pages. This API allows fine control and flexibility over the speech recognition capabilities in Chrome version 25 and later. Here's an example with the recognized text appearing almost immediately while speaking.

DEMO / SOURCE

Let’s take a look under the hood. First, we check to see if the browser supports the Web Speech API by checking if the webkitSpeechRecognition object exists. If not, we suggest the user upgrades their browser. (Since the API is still experimental, it's currently vendor prefixed.) Lastly, we create the webkitSpeechRecognition object which provides the speech interface, and set some of its attributes and event handlers.

The default value for continuous is false, meaning that when the user stops talking, speech recognition will end. This mode is great for simple text like short input fields. In this demo , we set it to true, so that recognition will continue even if the user pauses while speaking.

The default value for interimResults is false, meaning that the only results returned by the recognizer are final and will not change. The demo sets it to true so we get early, interim results that may change. Watch the demo carefully, the grey text is the text that is interim and does sometimes change, whereas the black text are responses from the recognizer that are marked final and will not change.

To get started, the user clicks on the microphone button, which triggers this code:

We set the spoken language for the speech recognizer "lang" to the BCP-47 value that the user has selected via the selection drop-down list, for example “en-US” for English-United States. If this is not set, it defaults to the lang of the HTML document root element and hierarchy. Chrome speech recognition supports numerous languages (see the “ langs ” table in the demo source), as well as some right-to-left languages that are not included in this demo, such as he-IL and ar-EG.

After setting the language, we call recognition.start() to activate the speech recognizer. Once it begins capturing audio, it calls the onstart event handler, and then for each new set of results, it calls the onresult event handler.

This handler concatenates all the results received so far into two strings: final_transcript and interim_transcript . The resulting strings may include "\n", such as when the user speaks “new paragraph”, so we use the linebreak function to convert these to HTML tags <br> or <p> . Finally it sets these strings as the innerHTML of their corresponding <span> elements: final_span which is styled with black text, and interim_span which is styled with gray text.

interim_transcript is a local variable, and is completely rebuilt each time this event is called because it’s possible that all interim results have changed since the last onresult event. We could do the same for final_transcript simply by starting the for loop at 0. However, because final text never changes, we’ve made the code here a bit more efficient by making final_transcript a global, so that this event can start the for loop at event.resultIndex and only append any new final text.

That’s it! The rest of the code is there just to make everything look pretty. It maintains state, shows the user some informative messages, and swaps the GIF image on the microphone button between the static microphone, the mic-slash image, and mic-animate with the pulsating red dot.

The mic-slash image is shown when recognition.start() is called, and then replaced with mic-animate when onstart fires. Typically this happens so quickly that the slash is not noticeable, but the first time speech recognition is used, Chrome needs to ask the user for permission to use the microphone, in which case onstart only fires when and if the user allows permission. Pages hosted on HTTPS do not need to ask repeatedly for permission, whereas HTTP hosted pages do.

So make your web pages come alive by enabling them to listen to your users!

We’d love to hear your feedback...

For comments on the W3C Web Speech API specification: email , mailing archive , community group
For comments on Chrome’s implementation of this spec: email , mailing archive

Refer to the Chrome Privacy Whitepaper to learn how Google is handling voice data from this API.

Last updated 2013-01-13 UTC.

Trending Categories

Selected Reading
UPSC IAS Exams Notes
Developer's Best Practices
Questions and Answers
Effective Resume Writing
HR Interview Questions
Computer Glossary

How to convert speech to text using JavaScript?

To convert the spoken words to the text we generally use the Web Speech API’s component that is “SpeechRecognition” . The SpeechRecognition component recognizes the spoken words in the form of audio and converts them to the text. The spoken words are stored in it in an array which are then displayed inside the HTML element on the browser screen.

The basic syntax used in is −

We can also use SpeechRecognition() instead of webkitSpeechRecognition() as webkitSpeechRecognition() is used in chrome and apple safari browser for speech recognition.

Step 1 − Create a HTML page as given below, create a HTML button using <button> tag. Add an onclick event in it with the function name “runSpeechRecog()”. Also create a <p> tag with id “action” in it.

Step 2 − Create a runSpeechRecog() arrow function inside a script tag as we are using internal javascript.

Step 3 − Select the “p” tag of HTML using Document Object Model (DOM) as document.getElementById(). Store it in a variable.

Step 4 − Create an object of a webkitSpeechRecognition() constructor and store it in a reference variable. So that all the methods of webkitSpeechRecognition() class will be in the reference variable.

Step 5 − Use “recognition.onstart()“, this function will return the action when the recognition is started.

Step 6 − Now use recognition.onresult() to display the spoken words on the screen.

Step 7 − Use the recognition.start() method to start the speech recognition.

Description

When the “runSpeechRecog()” function is triggered the webkitSpeechRecognition() is initialized and all the properties of this are stored in the reference and shows the below output as the browser is ready to listen to the user's spoken words.

When the user has stopped speaking the sentence, the result is stored in the form of an array of words. Then these words are returned as a transcript of a sentence on the user browser screen. For example a user runs this speech to text program on its browser and presses the speech button and start speaking as “tutorialpoint.com”, as user stops speaking the speech recognition program will stop and will display the transcript on the browser as “tutorialpoint.com”.

The Web Speech API of JavaScript is used in many types of applications. As the web speech api has two different components as SpeechRecognition API which is used for speech-text conversion and SpeechSynthesis API which is used for text-speech conversion. The above SpeechRecognition is supported for the browser Chrome, Apple Safari, Opera.

Related Articles
How to Convert Text to Speech in Python?
Converting Speech to Text to Text to Speech in Python
How to integrate Android Speech To Text?
How to create Text to Speech in an Android App using Kotlin?
Text to Speech Examples in Snack
How to convert JSON text to JavaScript JSON object?
Text to Voice conversion using Web Speech API of Google Chrome
How to convert NaN to 0 using JavaScript?
How to create a bold text using JavaScript?
How to make a text italic using JavaScript
How to create a strikethrough text using JavaScript?
How to create a blink text using JavaScript?
Working with AWS Amazon Polly Text-to-Speech (TTS) Service
How to convert an Image to blob using JavaScript?
How to convert Title to URL Slug using JavaScript?

Kickstart Your Career

Get certified by completing the course

Español – América Latina
Português – Brasil
Cloud Speech-to-Text
Documentation

All Speech-to-Text code samples

This page contains code samples for Speech-to-Text. To search and filter code samples for other Google Cloud products, see the Google Cloud sample browser .

About AssemblyAI

JavaScript Text-to-Speech - The Easy Way

Learn how to build a simple JavaScript Text-to-Speech application using JavaScript's Web Speech API in this step-by-step beginner's guide.

Contributor

When building an app, you may want to implement a Text-to-Speech feature for accessibility, convenience, or some other reason. In this tutorial, we will learn how to build a very simple JavaScript Text-to-Speech application using JavaScript's built-in Web Speech API .

For your convenience, we have provided the code for this tutorial application ready for you to fork and play around with over at Replit , or ready for you to clone from Github . You can also view a live version of the app here .

Step 1 - Setting Up The App

First, we set up a very basic application using a simple HTML file called index.html and a JavaScript file called script.js .

We'll also use a CSS file called style.css to add some margins and to center things, but it’s entirely up to you if you want to include this styling file.

The HTML file index.html defines our application's structure which we will add functionality to with the JavaScript file. We add an <h1> element which acts as a title for the application, an <input> field in which we will enter the text we want spoken, and a <button> which we will use to submit this input text. We finally wrap all of these objects inside of a <form> . Remember, the input and the button have no functionality yet - we'll add that in later using JavaScript.

Inside of the <head> element, which contains metadata for our HTML file, we import style.css . This tells our application to style itself according to the contents of style.css . At the bottom of the <body> element, we import our script.js file. This tells our application the name of the JavaScript file that stores the functionality for the application.

Now that we have finished the index.html file, we can move on to creating the script.js JavaScript file.

Since we imported the script.js file to our index.html file above, we can test its functionality by simply sending an alert .

To add an alert to our code, we add the line of code below to our script.js file. Make sure to save the file and refresh your browser, you should now see a little window popping up with the text "It works!".

If everything went ok, you should be left with something like this:

Add Speech Recognition to Your JavaScript App

Get started with AssemblyAI's powerful JavaScript SDK for free.

Step 2 - Checking Browser Compatibility

To create our JavaScript Text-to-Speech application, we are going to utilize JavaScript's built-in Web Speech API. Since this API isn’t compatible with all browsers, we'll need to check for compatibility. We can perform this check in one of two ways.

The first way is by checking our operating system and version on caniuse.com .

The second way is by performing the check right inside of our code, which we can do with a simple conditional statement:

This is a shorthand if/else statement, and is equivalent to the following:

If you now run the app and check your browser console, you should see one of those messages. You can also choose to pass this information on to the user by rendering an HTML element.

Step 3 - Testing JavaScript Text-to-Speech

Next up, let’s write some static code to test if we can make the browser speak to us.

Add the following code to the script.js file.

Code Breakdown

Let’s look at a code breakdown to understand what's going on:

With const synth = window.speechSynthesis we declare the synth variable to be an instance of the SpeechSynthesis object, which is the entry to point to using JavaScript's Web Speech API. The speak method of this object is what ultimately converts text into speech.
let ourText = “Hey there what’s up!!!!” defines the ourText variable which holds the string of text that we want to be uttered.
const utterThis = new SpeechSynthesisUtterance(ourText) defines the utterThis variable to be a SpeechSynthesisUtterance object, into which we pass ourText .
Putting it all together, we call synth.speak(utterThis) , which utters the string inside ourText .

Save the code and refresh the browser window in which your app runs in order to hear a voice saying “ Hey there what’s up!!!! ”.

Experience Speech-to-Text in Action

Test our API with your own audio files in our interactive playground - no coding required.

Step 4 - Making Our App Dynamic

Our code currently provides us with a good understanding of how the Text-to-Speech aspect of our application works under the hood, but the app at this point only converts the static text which we defined with ourText into speech. We want to be able to dynamically change what text is being converted to speech when using the application. Let’s do that now utilizing a <form> .

First, we add the const textInputField = document.querySelector("#text-input") variable, which allows us to access the value of the <input> tag that we have defined in the index.html file in our JavaScript code. We select the <input> field by its id: #text-input .
Secondly, we add the const form = document.querySelector("#form") variable, which selects our form by its id #form so we can later submit the <form> using the onsubmit function.
We initialize ourText as an empty string instead of a static sentence.
We wrap our browser compatibility logic in a function called checkBrowserCompatibility and then immediately call this function.

Finally, we create an onsubmit handler that executes when we submit our form. This handler does several things:

event.preventDefault() prevents the browser from reloading after submitting the form.
ourText = textInputField.value sets our ourText string to whatever we enter in the "input" field of our application.
utterThis.text = ourText sets the text to be uttered to the value of ourText .
synth.speak(utterThis) utters our text string.
textInputField.value resets the value of our input field to an empty string after submitting the form.

Step 5 - Testing Our JavaScript Text-to-Speech App

To test our JavaScript Text-to-Speech application, simply enter some text in the input field and hit “Submit” in order to hear the text converted to speech.

Additional Features

There are a lot of properties that can be modified when working with the Web Speech API. For instance:

You can try playing around with these properties to tailor the application to your needs.

This simple example provides an outline of how to use the Web Speech API for JavaScript Text-to-Speech .

While Text-to-Speech is useful for accessibility, convenience, and other purposes, there are a lot of use-cases in which the opposite functionality, i.e. Speech-to-Text, is useful. We have built a couple of example projects using AssemblyAI’s Speech-to-Text API that you can check out for those who want to learn more.

Ready to Add Speech-to-Text to Your JavaScript Project?

Some of them are:

React Speech Recognition with React Hooks
How To Convert Voice To Text Using JavaScript

How to identify languages in audio data using Python

Senior Developer Advocate

Speech AI apps: 8 new Speech AI tools, releases, updates, and more

How to perform Speaker Diarization in Python

Developer Educator

Speaker diarization vs speaker recognition - what's the difference?

Javascript Text To Speech (Simple Examples)

Table of contents, download & notes.

Here is the download link to the example code, so you don’t have to copy-paste everything.

EXAMPLE CODE DOWNLOAD

Sorry for the ads..., javascript text to speech.

All right, let us now get into more examples of using text-to-speech in Javascript.

TUTORIAL VIDEO

1) simple text to speech, 1a) the html, 1b) the javascript.

This is the same as the introduction snippet, except that it does a feature check before enabling the test button – if ("speechSynthesis" in window) . At the time of writing, speechSynthesis is not “universally supported” in all browsers and operating systems. So, it’s good to add a few lines of code and do compatibility checks.

1C) THE DEMO

2) choosing a voice, 2a) the html, 2b) the javascript, 2c) the demo, 3) more controls – volume, pitch, rate, 3a) the html, 3b) the javascript, 3c) the demo, compatibility checks, links & references.

Stack Overflow for Teams Where developers & technologists share private knowledge with coworkers
Advertising & Talent Reach devs & technologists worldwide about your product, service or employer brand
OverflowAI GenAI features for Teams
OverflowAPI Train & fine-tune LLMs
Labs The future of collective knowledge sharing
About the company Visit the blog

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Get early access and see previews of new features.

Google Cloud Speech with Javascript

In documentation and tutorial for REST API (Google Sppech API for Node: https://cloud.google.com/nodejs/apis ), so my question is how to use the Cloud Speech API in JavaScript. Someone used on any page with javascript?

2020-04-24 EDIT : The accepted answer is using webkitSpeechRecognition which is not available on mobile: https://stackoverflow.com/a/61039699/775359

Google documentation and examples: https://cloud.google.com/speech-to-text/docs/samples

Node.js code: https://github.com/googleapis/nodejs-speech/blob/master/samples/MicrophoneStream.js

REQUIREMENT: it has to run in Safari on iOS.

speech-recognition
google-cloud-platform
google-speech-api

I think you have the wrong idea about what the google cloud API's are all about - perhaps this is closer to what you need – Jaromanda X Commented Aug 5, 2016 at 12:00
Did you acomplish this, with just javaScript? – Ivalberto Commented Aug 25, 2021 at 11:50

The Google Cloud API is more specifically used for server-side speech processing. If you're looking to use voice recognition through a browser, you should use your browser's built in Web Speech API . Here's a simple example:

var recognition = new webkitSpeechRecognition(); recognition.continuous = true; var output = document.getElementById('output'); recognition.onresult = function(event) { output.textContent = event.results[0][0].transcript; }; <div id="output"></div> <button onclick="recognition.start()">Start</button> <button onclick="recognition.stop()">Stop</button>

5 They are totally different API. Web Speech API you are suggesting is private API and has limitation of 50 requests a day! We cant use if for commercial projects with this limitations. Please go through chromium.org/developers/how-tos/api-keys before you make the decision of using Speech Web API – Paresh Varde Commented Oct 7, 2016 at 7:02
1 This simply isn't true. The Web Speech API is a W3C supported, a cross-browser feature that even exists in Firefox today. It's up to the browser vendor decide how the speech is parsed, and the Google API keys come built into Google's Chrome builds by default. For a custom Chromium build, you would need to obtain API keys. – fny Commented Oct 7, 2016 at 20:40
@PareshVarde Again, that's for compiling your own Chromium build. I can easily make thousands of requests in a day from my own browser without any issues. Here's an example that makes 1 request every 3 seconds for 100 requests: jsbin.com/faxeyi – fny Commented Oct 11, 2016 at 19:24
1 I am not using Google Chrome. I am packaging my application into electron atom and it causes issue. Google Chrome works fine without any affecting any quota limit they specified. – Paresh Varde Commented Oct 12, 2016 at 12:59
Quick test using webkitSpeechRecognition in Chrome - rather poor quality. I'm keen to try Cloud version, opened a bounty of this question... – Mars Robertson Commented Apr 27, 2020 at 16:28

Your Answer

Reminder: Answers generated by artificial intelligence tools are not allowed on Stack Overflow. Learn more

Sign up or log in

Post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged javascript speech-recognition google-cloud-platform google-speech-api or ask your own question .

The Overflow Blog
The evolution of full stack engineers
One of the best ways to get value for AI coding tools: generating tests
Featured on Meta
User activation: Learnings and opportunities
Site maintenance - Mon, Sept 16 2024, 21:00 UTC to Tue, Sept 17 2024, 2:00...
Staging Ground Reviewer Motivation
What does a new user need in a homepage experience on Stack Overflow?

Hot Network Questions

Working principle of the Zener diode acting as a voltage regulator in a circuit
How to split a plane into n-gon mesh as defined by the voronoi color texture?
Philosophical dogma hindering scientific progress?
Should I change advisors because mine doesn't object to publishing at MDPI?
How can I remove this towel rail mount from my wall?
Why does documentation take a large storage?
Engaging students in the beauty of mathematics
Why would the GPL be viral, while EUPL isn't, according to the EUPL authors?
Gridded plane colouring problem. Can a 2x2 black square be created on a white gridded plane using 3x3 and 4x4 "stamps" that invert the grid colour?
The action matters if intention its to just take good action and not action itself?
Why are some Cloudflare challenges CPU intensive?
How do Protestants define what constitutes getting married?
A journal has published an AI-generated article under my name. What to do?
What came of the Trump campaign's complaint to the FEC that Harris 'stole' (or at least illegally received) Biden's funding?
Book that features clones used for retirement
Does any row of Pascal's triangle contain a Pythagorean triple?
How much could gravity increase before a military tank is crushed
Why are my empty files not being assigned the correct mimetype?
Why is resonance such a widespread phenomenon?
Second cohomology of nonabelian finite simple group vanishes.
How do you ensure that calendar invites won't be moved out of your inbox when your rules are executed?
Precision issues when combining scale instances node with translate instances (or offset instances)
Problem with the hinge mechanism of my folding bike (Smart Truck 300, made ~2016)
What film is it where the antagonist uses an expandable triple knife?

IMAGES

JavaScript Text-to-Speech
JavaScript Text to Speech with Code Example
Text To Speech Converter in HTML CSS & JavaScript
Speech to Text Conversion using JavaScript
Text to speech using JavaScript
JavaScript Text to Speech Using Synthesis API

VIDEO

How to Create a Text-to-Speech Converter Using JavaScript
Text to speech Converter using javascript mini project || #coding #shorts
Ubah Teks menjadi suara (Text to Speech) di Javascript
Text to speak project. use #html #css and #javascripttutorial
Text to speech in Javascript#webdeveloper #reels #webdesign #website #javascript
Speech to Text App

COMMENTS

JavaScript Speech Recognition Example (Speech to Text)
With the Web Speech API, we can recognize speech using JavaScript. It is super easy to recognize speech in a browser using JavaScript and then getting the text from the speech to use as user input. We have already covered How to convert Text to Speech in Javascript. But the support for this API is limited to the Chrome browser only. So if you ...
How to convert speech into text using JavaScript
The text-decoration property in CSS is used to style the decoration of text, such as underlining, overlining, line-through, and blink effects. It is a versatile property that can enhance the visual appearance of text. Syntax/* Example of using the text-decoration property */ .element { text-decoration: [value]; }The text-decoration Property ValuesT
How To Convert Voice To Text Using JavaScript
Step 2: Set up the client with a WebSocket connection in JavaScript. Next, create the index.js and access the DOM elements of the corresponding HTML file. Additionally, we make global variables to store the recorder, the WebSocket, and the recording state. // required dom elements const buttonEl = document. getElementById ('button');
Using the Web Speech API
Using the Web Speech API. The Web Speech API provides two distinct areas of functionality — speech recognition, and speech synthesis (also known as text to speech, or tts) — which open up interesting new possibilities for accessibility, and control mechanisms. This article provides a simple introduction to both areas, along with demos.
Building a Real-time Speech-to-text Web App with Web Speech API
In this short tutorial, we will build a simple yet useful real-time speech-to-text web app using the Web Speech API. Feature-wise, it will be straightforward: click a button to start recording, and your speech will be converted to text, displayed in real-time on the screen. We'll also play with voice commands; saying "stop recording" will halt ...
Building a Speech-to-Text Web App using HTML and JavaScript
Make sure you have a compatible web browser that supports the Web Speech API, such as Google Chrome. Let's start by setting up the HTML structure for our Speech-to-Text web application. Create ...
Building a Simple Voice-to-Text Web App Using JavaScript and Speech
Congratulations, you've just created your very own Voice-to-Text web application using JavaScript and the Speech Recognition API! 🎉 Now, when you visit your web page, you can click the ...
Building a Speech to Text App with JavaScript
Writing the JavaScript Code. Create a file named script.js and paste the following code inside: var speechRecognition = window.webkitSpeechRecognition var recognition = new speechRecognition() var textbox = $("#textbox") var instructions = $("#instructions") var content = '' recognition.continuous = true // recognition is started recognition ...
How to Implement Speech-to-Text Quickly and Easily (in JavaScript)
If you do not encapsulate speech recognition in a function, the speech will be captured the moment you open the web page. That's not what we want. Next, we are creating a constant variable called SpeechRecognition. This will hold the Speech Recognition interface. It's an interface that will provide the tools we need to capture speech.
How to convert voice to text with javascript ...
Basic example. The following code will do the most basic support to retrieve what the user says, you can use interim_transcript and final_transcript to show the user the recognized text. var recognition = new webkitSpeechRecognition(); recognition.continuous = true; recognition.interimResults = true; recognition.lang = "en-GB";
Converting from Speech to Text with JavaScript
It expects one argument, an instance of the beautifully named SpeechSynthesisUtterance class. Here is the entire code needed to read out a string. function readOutLoud(message) {. var speech = new SpeechSynthesisUtterance(); // Set the text and voice attributes. speech.text = message; speech.volume = 1; speech.rate = 1;
Build a Speech-to-text Web App with Whisper, React and Node
Build a Speech-to-text Web App with Whisper, React and Node. In this article, we'll build a speech-to-text application using OpenAI's Whisper, along with React, Node.js, and FFmpeg. The app ...
Speech to text in the browser with the Web Speech API
Before we build speech recognition into our example application, let's get a feel for it in the browser dev tools. In Chrome open up your dev tools. Enter the following in the console:
Using the Speech-to-Text API with Node.js
1. Overview Google Cloud Speech-to-Text API enables developers to convert audio to text in 120 languages and variants, by applying powerful neural network models in an easy to use API.. In this codelab, you will focus on using the Speech-to-Text API with Node.js. You will learn how to send an audio file in English and other languages to the Cloud Speech-to-Text API for transcription.
Voice driven web apps
The new JavaScript Web Speech API makes it easy to add speech recognition to your web pages. This API allows fine control and flexibility over the speech recognition capabilities in Chrome version 25 and later. Here's an example with the recognized text appearing almost immediately while speaking. DEMO / SOURCE. Let's take a look under the hood.
How to convert speech to text using JavaScript?
Algorithm. Step 1 − Create a HTML page as given below, create a HTML button using <button> tag. Add an onclick event in it with the function name "runSpeechRecog ()". Also create a <p> tag with id "action" in it. Step 2 − Create a runSpeechRecog () arrow function inside a script tag as we are using internal javascript.
All Speech-to-Text code samples
Cloud Speech-to-Text. Cloud Speech-to-Text V1 Guides Reference Samples Support Resources. Contact Us Start free. Try Gemini 1.5 models, the latest multimodal models in Vertex AI, and see what you can build with up to a 2M token context window. Save and categorize content based on your preferences.
JavaScript Text-to-Speech
Step 5 - Testing Our JavaScript Text-to-Speech App. To test our JavaScript Text-to-Speech application, simply enter some text in the input field and hit "Submit" in order to hear the text converted to speech. Additional Features. There are a lot of properties that can be modified when working with the Web Speech API. For instance: Language ...
Javascript Text To Speech (Simple Examples)
Yes, the Stone Age of the Internet is long over. Javascript has a native speechSynthesis text-to-speech API, and it will work so long as the browser and operating system support it. var msg = new SpeechSynthesisUtterance("MESSAGE"); speechSynthesis.speak(msg); That covers the quick basics, read on for more examples!
Using Google Text-To-Speech in Javascript
Learn how to use Google Text-To-Speech in Javascript with examples and solutions from Stack Overflow, the largest online community for programmers.
How To Build a Text-to-Speech App with Web Speech API
You can also set the text to be uttered by setting the text property of the utterance object. Here is a simple example: var synthesis = window.speechSynthesis; var utterance = new SpeechSynthesisUtterance("Hello World"); // This overrides the text "Hello World" and is uttered instead.
Google Cloud Speech with Javascript
11. The Google Cloud API is more specifically used for server-side speech processing. If you're looking to use voice recognition through a browser, you should use your browser's built in Web Speech API. Here's a simple example: var recognition = new webkitSpeechRecognition(); recognition.continuous = true;