Wednesday, February 13, 2008

BlogPost:Using speech-to-text software for transcribing screen capture videos

I am writing this post as a reply to a question raised by Daniel Moth on his blog:

"Would it be possible to point an application at a video file of a screencast, automatically extract all the text, and create a timestamped index of words so it becomes very easy to browse the video?"

I dont quite think we are there yet.

I tried an experiment with Dragon Naturally Speaking (DNS) 9. First off, I used my own voice (I am from India, and still speak with an accent) for the initial training.

Then I used a screencast video from LearnVisualStudio.NET for finding out the accuracy of the software.

The video itself can be found here (the C# version)

Video source

I created a perfectly typed out transcript of the first two minutes of the video:

"Treeview control provides a hierarchical view of data. In this first example that we will take a look at I have a treeview control on the left hand side and corresponding items on the right hand side represented in a listview control. This mimics a Windows explorer like interface and I am simply mocking up a document management application and we will use this example to kind of demonstrate three different concepts. First of all, how to add tree nodes and create this parent child hierarchy between tree nodes. Then we will take a look at how we will associate values such as listview items with corresponding tree nodes on the left hand side. And then finally we will take a look at how to change the icon state between the open state whenever the tree is expanded and the closed state whenever the tree is collapsed.To prepare this example, i first started with a split container control and then placed a treeview control within the first panel and a listview control within the second panel. I also added a imagelist control and configured the images in the image collection editor. You see that I have four images, the first two images indexes zero and one are used for my treeview and indexes two and three are used for my listview control. I also associated my imagelist with my treeview control in code. I could have even easily done that by working with the imagelist property within the properties window. Unlike most of the other controls we featured in the windows forms controls series, the treeview control really lends itself to be configured within the code view. so thats what we will do"


Here is the transcript obtained by extracting the audio from the video file and creating an mp3 file for the recording. Remember that this is based on training it with my voice. Dragon Naturally Speaking provides a way to transcribe the audio in an mp3 file.

"The is the had the row against in a sparring items are the right against in generate a STB draw UNIX and Windows store at your face of civil marking up attacked in at an application MoUs six-year-old demonstrate three different hearts at Sprowston high at three nerves a creative spirit sharp hierarchy which we treat others devoted to work at Harrow associated leave such as whisky widens with the course neutrinos of a 10 side ended five optical look at how unchanged I can't stay between the open state were average resuspended" stage were average free is TO prepare this in Singapore I have restored with this particular to draw an improvised Atreides control within the first 10 a.m. was seeking straw within the second can I also carried a Guinness was to draw a figurative the images in the image collection and where is he that hath more images of refugees indexes row in one use from much Revue in India existed to release from was due to draw I have so associated by its Westcliff might frequent row in third that it is easily done had are working with be imaged West properly with other parties when they are of atmosphere the controls we featured in the warders watch control series the review control romances are to be configured with the electoral deals were sold to"

:-)
Then I trained the voice of the speaker in the video. You can email me if you would like to know how I did that. This is the result from transcribing the mp3 file using the account created from the speaker's voice:

"Him are a data in this first example take a look at the Treeview control the left-hand side and corresponding items on the right-hand side represented in a list control this mimics a Windows Explorer like interface and I'm simply mocking public document management application and will use this example to cut demonstrate three different concepts first of all how to add tree nodes and create this parent-child hierarchy between tree nodes that will take a look at how to associate values such as list view items with the corresponding tree nodes on the left-hand side and then finally we'll take a look at how to change the icon state between the open state whatever the truth expanded and the clothes date whatever the tree is collapsed to prepare this example I first started with a split container control and then placed a Treeview control within the first panel and a list view control within the second panel I also added a imageless control and configured the images in the image collection editor you see that I have for images the first two images indexes zero and one are used for much review and indexes two and three are used from a list view control I also associated my interest list with my Treeview control in code I could have easily done at by working with the image list property within the properties window I like most the other controls we featured in the Windows forms control series the Treeview control or lends itself to be configured within the code abuse vessel will do"

Much better accuracy, and even some domain specific words are being recognized quite nicely! So it is definitely a fantastic piece of software, and you can see why quite a few people are giving it rave reviews on Amazon.com. Interestingly, some of them have even dictated their reviews.

Which brings us back to the original question - will it be possible to obtain such high recognition accuracy that one can actually automatically extract the text from audio sources with zero training? I doubt it. The problem becomes even harder when we have multiple presenters involved in a single screencast.

So until that time, manual solutions are still going to be quite useful for this problem.

No comments: