Google Cloud Speech-to-Text – what is it, and how to use it?

Google Cloud Speech-to-Text service – what is it?
Speech-to-text conversion models in Google Cloud
Speech-to-Text use cases
How much does Speech-to-Text cost?
Speech-to-TeHow to use Speech-to-Text? Tutorial

Speech-to-text transcription is a breakthrough technology that enhances everyday human-machine interaction. It allows computers to recognise speech and respond to spoken commands. This translates into the automation of many activities and the creation of new productivity tools or customer service support systems.

Google Cloud Speech-to-Text service – what is it?

Speech-to-Text is one of Google Cloud services. It’s used for automated speech-to-text conversion and transcription. It uses advanced machine learning models from Google and allows transcription in more than 125 languages and dialects. The Speech-to-Text service is provided via an API (application programming interface, which allows you to connect it with your application. Thanks to that, an already functioning and proven service can be implemented at a relatively low cost in any product.

Speech-to-Text can process speech in two ways:

in real-time, as the user speaks to the application with the service activated,
or carry out speech transcription from uploaded audio or video file.

The service copes with the transcription of even heavily professional phrases and terms. Thanks to the use of classes, it also converts ‘spoken’ numbers, addresses or dates to the target notation (e.g. fifty-three will be written as 53).

For an application developed in containers in Kubernetes clusters, the Speech-to-Text service can be used in an on-premise model. The service is deployed to an application as a container, after which it can be used in a local environment. This solution will be handy for organisations that must meet regulations and restrict cloud computing.

Speech-to-text conversion models in Google Cloud

The service offers different transcription models to suit the type of recordings or audio sources. Currently, there are four models:

default automatic speech recognition – this model can be used to transcribe longer recordings containing a single speaker’s voice. The model works best for recordings with a frequency of 16 000 Hz or higher,
automatic speech recognition for command and search – a model dedicated to the transcription of short recordings, for example, voice commands sent to applications,
video transcription – a model for converting speech to text from video footage in which multiple speakers are recorded. It supports recordings or streaming at 16,000 Hz or higher. This is a premium model, and its cost is higher than the models above,
phone call transcription – a model designed to transcribe calls made over the phone. It best supports recordings at 8,000 Hz. It is a premium model as well.

With models, a speech processing service can be matched to the purpose of the application. A different model will be chosen when creating a platform for streaming speeches, a customer service support platform, and another for an application using voice commands.

Speech-to-Text use cases

Speech-to-Text opens up many possibilities, and transcription models allow the service to be used in various applications. Automatic conversion from speech to text can be used in, i.a., automation or customer service support, in real-time video transcription or voice command applications. Here are some scenarios where Speech-to-Text will play a significant role.

Customer service support

Speech-to-Text is one of the core services of Contact Center AI – a suite from Google Cloud for creating customer service solutions using artificial intelligence.

With the help of Speech-to-Text (and other services in the Contact Center AI portfolio), it is possible to create, among other things, a support system for consultants working in the call centre. By conducting a real-time transcription of the conversation, analysing the dialogue and reading the customer’s intentions, the system serves the service agent with the necessary materials and guidance on how to continue the conversation. Using the service, it is possible to build an IVR (interactive voice response) system – an automated call centre operated by the voice of the customer, which helps to solve simple problems and, in the case of more complex issues, redirects the caller to a consultant.

Voice control

Speech-to-Text allows you to implement voice commands and control the application using speech. It even has a dedicated transcription model – ASR: Command and search. By using the service, the app can respond to questions or voice commands, for example, “play the next movie in the queue”, “turn up the volume,” or “check the weather for Saturday”.

Media transcription

Speech-to-text allows you to subtitle your videos in real-time. You can also use the service to transcribe recordings and index the text to increase the reach of the material. Subtitles alongside the video will also positively impact the audience experience – the vast majority of social media users watch videos without the sound on.

Translation

Speech-to-text is a service that supports translation – whether simultaneous or as subtitles added to a video. This is because the service transcribes and then translates the text, not the audio directly. As a result, we can display English subtitles next to a film in a foreign language or use the simultaneous translator in Google Assistant.

Castbox – a podcasting platform that leverages Speech-to-Text

Castbox is a Hong Kong-based company – the largest podcast platform in that region, with around 2 million daily users. It provides nearly 100 million recordings – podcast episodes and audiobooks in more than 70 languages. What is unique, Castbox keeps a transcription of all the available recordings, indexes the content and offers the possibility to search for excerpts from specific episodes by phrases and keywords.

Castbox, thanks to Google Cloud services, can transcribe around 20 hours of recordings per day, with a 96% success rate in converting speech to text.

How much does Speech-to-Text cost?

The first 60 minutes each month are free of charge. Beyond one hour, charges are billed for the next 15 seconds of usage.

For automatic speech recognition models (ASR: Default and ASR: Command and search), it is $0.006 per 15 seconds.

For premium models (Video and Phone Call), the price is $0.009 per 15 seconds.

Join our Google Cloud Platform community

Be up to date with cloud updates and connect with other GCP users

Become a member

Speech-to-TeHow to use Speech-to-Text? Tutorial

To implement Speech-to-Text, you need an account on Google Cloud.

Go to the console. Create a new project and remember its ID.

Speech to text - tworzenie nowego projektu

From the sidebar (menu on the left), select APIs & Services / Dashboard.

Click ENABLE APIS AND SERVICES.

Search for Cloud Speech API.

Click Enable, and wait a few seconds.

Run Cloud Shell by clicking the icon in the top right corner.

Wait for the user@project:~$ project to set up.

Then generate an API key for forwarding requests. To create the key, go to APIs & Services / Credentials.

Select Create credentials and click on the API key in the drop-down menu.

Copy the key that was just generated. In Cloudshell, insert the export command. Replace your_api_key with your generated key.

export API_KEY= YOUR_API_KEY

You can build the request to the service API in a request.json file. To create this file, you can use Cloud Shell’s built-in code editor:

Uruchamienie edytora tekstu w Cloud Shell

Create a file named request.json in your directory and add the following lines:

{
config {
encoding:FLAC, languageCode: en-US
}audio: {
uri: gs://cloud-samples-tests/speech/brooklyn.flac
}
}

In the console, type the following command (on one line):

curl -s -X POST -H Content-Type: application/json --data-binary @request.json https://speech.googleapis.com/v1/speech:recognize?key=${API_KEY}

The response should be as follows:

{
results: [
{
alternatives: [
{
transcript: how old is the Brooklyn Bridge, confidence: 0.98267895
}
]
}
]
}

The transcript value contains the transcription of the brooklyn.flac audio file created by the service. Confidence indicates the probability with which the API has correctly recognised and processed the speech into text.

And that’s it! This is how the Speech-to-Text API works.