Microsoft Cognitive Services
by Mary Branscombe
Microsoft has bundled its AI services into a set of APIs known as Microsoft Cognitive Services.
HardCopy Issue: 70 | Published: November 4, 2016
Machine learning is the hot new technique for accomplishing everything from recognising speech to offering recommendations on web stores to checking if that email from your boss is really a wire fraud phishing scam. Building machine learning systems is still complicated, but you can take advantage of machine learning algorithms with a minimum of code by calling the many APIs offered by Microsoft Cognitive Services. These cover vision, speech, language, knowledge and search, and provide functionality that can help with object recognition, emotion detection, facial identification, speech understanding, sentiment analysis and text analytics.
Want to tag the contents of photos so you can search for them later? Or extract the text from a credit card receipt so it can drop straight into an expense claim? Want to find out if the people stopping at your stand at a trade show are surprised or bored by your products? Want your support chatbot to be able to deal with a wide range of language rather than just a few keywords? How about something ambitious like describing the world or reading a menu to a blind person? Cognitive Services has an API for all of these and more. Your app can work with natural and spoken language.
As principal program manager Ryan Galgon explains, “Microsoft Cognitive Services is a collection of cross-platform, online APIs for developers to be able to access Microsoft artificial intelligence capabilities in, we hope, ways that are very easy to use and very easy to consume. We want to make these capabilities available online regardless of what platform someone is on or what language they’re using, so we make sample code available in as many languages as we can, including Python, C#, Objective C, Swift and so forth. When we say ‘artificial intelligence capabilities’, we’re talking about things in the field of computer vision, speech, natural language, knowledge representation and search. It’s a collection we keep adding to over time; not only new capabilities, but we improve and ship updates to capabilities we already have in the services and we try and do it in such a way that developers don’t need to update code in order to get new and improved results.”
From Bing to code
What’s now Cognitive Services started in 2015 as Project Oxford which provided four APIs for speech recognition, facial recognition, object recognition and language understanding, drawing on the work that Bing and other Microsoft teams had been doing to build AI features into their products. It’s now a commercial service with some 22 APIs at the time of writing, and the original services have been improved and expanded as well.
That includes the Bing search APIs. “You can pull in all of Bing’s knowledge of the web into any application; you can get access to Bing news results, Bing web results, images and videos, even the Bing search suggestions,” says Galgon. That’s not just about embedding search queries; you get to piggyback on the knowledge graph Bing uses to represent all the entities in the world, which you might know as Satori. That’s how Bing knows that movies have directors and actors and posters and release dates and sound tracks and script writers and filming locations, while restaurants have menus and opening hours and special offers. For news stories, it means you can specify a topic; for images, you can get machine-generated captions, or a selection of images that are visually similar.
Services are the new components
Image captioning and sentiment analysis aren’t the only features you can get as web services. Want to add text messages or phone calls to your app? Call one of Twilio’s REST APIs from your code. Need maps and routing in the software you’re writing? Bing Maps has an API that can give you address checking, location maps and driving directions. Need to generate an invoice or send a receipt? Plug in the SendGrid service. As long as your app is going to be connected, you can increasingly call services through RESTful APIs to provide key functionality, instead of writing it yourself or buying plug-in components or products.
For example, Uber’s app uses Google Maps for directions, Twilio for the text messages that passengers exchange with drivers, Braintree for payments, SendGrid for receipts and Box for storing content. Uber is also using Microsoft Cognitive Services for its new Real-Time ID Check, which uses the Face API to check driver selfies to make sure that it’s the right person behind the wheel.
Even platforms like Salesforce and products like SendGrid and JIRA let you retrieve information and send events through APIs so that you can build them directly into your own tools. Indeed it’s typical for more than half the traffic on cloud services to go through their APIs rather than their web interface.
In many cases, the services you can now call offer features you wouldn’t have been able to get from a packaged product, and even if you could it would have been hard to bundle them into your own app. If you used Microsoft MapPoint to generate mapping and routing, you could only call that from apps running on your own network, for example. And because these are cloud services, they can keep adding new features while the APIs you use remain stable – although you might want to check for deprecation policies and API lifecycles before you become dependent on a particular service. But if a new credit card becomes popular, Stripe and Braintree can start supporting it without you needing to do extra work. If Twilio switches from one cellular network to another for better connectivity, you won’t need to change the way your app sends and receives SMS.
To get the most from these services, look for functionality that would be hard for you to deliver yourself, and make sure you know what the costs are for different transactions levels.
Some of the APIs are quite specialised. The Academic Knowledge API, for example, can create a graph of citations by year for an author. Others are more broadly applicable. The Recommendations API is a fast way to get suggestions for an ecommerce site, so it can show products that are often bought together, and personalise that list based on what a visitor has bought to recommend what they might specifically like. You can also use it to analyse traffic to see how easy it is to find products on your site.
This works by building a machine learning model from your site catalogue and transactions. Knowing what alternatives other customers buying a product chose could be useful when a buyer calls up to arrange a return, and knowing the patterns of your sales could help you manage inventory more efficiently. Microsoft Dynamics already has some of those tools, but because this is an API, you can use it with whatever CRM or ERP system your business employs, as well as using it to show suggestions on your website.
The majority of the Cognitive Services APIs deal with services that are more obviously ‘artificial intelligence’ in that they work with language, speech and vision. That’s everything from understanding text and checking spelling to converting speech to text, identifying and authenticating people by their voice, detecting faces (including celebrities) and emotions, to actually understanding the content of an image.
The Language Understanding Intelligence Service (LUIS) looks at text to understand the topic and intent of what someone is asking, so “Tell me about flight delays” gets parsed as a news query for the topic ‘flight delays’. This makes it much easier for developers to model the mapping of the full range of language that people use when they’re talking or typing onto terms that the app can work with. If you’re writing a bot to take pizza orders, for example, you can expect addresses to show up in a limited number of formats, and having Bing’s expertise behind it means LUIS can understand date and time, ordinals, numbers, temperature, distances and proper nouns. However, you also want to handle phrases that are more specific to your application, so your customers might say “send me a pizza” or “get me a pizza” or “deliver a big pepperoni”, or dozens of other variants.
Developers can pair that level of functionality with sentiment analysis and even image recognition. As Galgon puts it, “I might want my bot to be aware of a sentence that is delivered with a strongly positive or strongly negative sentiment, or I might want to let the bot understand an image that’s been sent to it.”
Even traditional features like spell checking get better with machine learning, because language changes. You need the basics, because if someone is typing a question to a bot, you don’t want the fact that they typed ‘hicago’ instead of Chicago to confuse the bot. But this API goes a lot further than traditional spell checking. “Even when you’re looking up static words in a dictionary the challenges are not having the context of the sentence or the paragraph [to make sense of it]. The bigger problem is not adapting over time when new phrases get coined or when a new startup becomes popular; all of a sudden ‘lift’ can be Lyft – which is a valid word now but wasn’t a year or two ago,” Galgon explains. “And the nice thing about making it a web service is that when we have new words and models we update those in the back end and developers get better results for free.”
You can also tweak speech recognition using the Custom Recognition Intelligence Service (CRIS), which helps you build an adaptive audio model for your applications. Speech recognition has mainly been trained with samples from adults working in an office or a conference room. If you want to recognise children or older people, or people who speak English as a second language, you’ll get better results if you use a custom language model. CRIS also lets you build an acoustic model from uploaded samples of audio recorded on location, along with transcriptions, which will help your app cope with running from a kiosk in a shopping centre with lots of background noise, or in the echoing lobby of a large building.
The Cognitive Services APIs are designed to be simple to get started with, and to allow you to get more sophisticated as you gain experience. As Galgon explains, “For the vision APIs, where we offer the capability to describe what’s going on in an image, it’s as simple as sending a photo to the API and we return a response like ‘that shows a man playing in a field and a woman riding a bicycle’.” And you can get more detail if you need it. Send a photo to the facial detection API and it returns age, gender, head pose, smile, facial hair information, the facial bounding box and 27 landmarks for every face in the image. Emotion isn’t just happy or sad; it can recognise anger, contempt, fear, disgust, happiness, neutral, sadness or surprise. So you could stick with the speech APIs as you start building your app, then move to using CRIS when you need the custom models.
CRIS is currently a private preview, although Galgon notes that “we’re letting people in to the preview pretty frequently”, and it will be a public preview soon. More APIs are on the way, and existing APIs get regular improvements, such as adding the caption service for images and extending the number of categories of objects the vision service can recognise, as well as extending LUIS to understands languages. There’s also a lot of demand from developers who would like more services that can work with video as well as with still images.
Intelligence for peanuts
You can sign up to Cognitive Services with a free subscription, which you can use even in a commercial app. When you need to make more calls to the API than are covered in the free tier, pricing is based on how much you use the service. “The free tiers cover the vast majority of developers who are building services,” says Galgon. “For the face detection API, for example, that’s 30,000 API calls a month for free, so if all you’re doing is trying to detect faces in an image, that’s 30,000 images that can be sent each month.”
Once you exceed 30,000 calls a month, the face detection API costs $1.50 per 1,000 calls. However, you also need to consider throughput: “The free tiers throttle how often calls can be sent, so a developer might choose to move to the paid tier to get higher throughput for many simultaneous transactions.”
“When you’re getting started, you’re talking cents to dollars in costs, although it depends on your use case,” Galgon explains. “Some apps use five APIs for every action and some apps use only one.”
If you’re creating thumbnails for images and you want to crop into the most interesting area of the photograph automatically rather than just shrinking the whole image, for example, you could combine object recognition, facial detection and OCR and use that to decide what to highlight. If it’s a picture of a person, then you want to keep their face in shot; if there’s text in the image, that’s what you want to show on the thumbnail; if it’s a picture of a bicycle in a street, you want to be able to zoom in to the bike.
How clever is clever?
Learning to use Cognitive Services isn’t just about signing up and calling the APIs; it’s also important for developers to understand that what you get with a lot of the services isn’t a ‘yes’ or a ‘no’, but a probability score.
“Instead of saying this face is 100 percent happy with no other emotion, we’re often saying we think it’s happiness with a 73 percent probability but also anger with a 20 percent probability, and there’s a few other emotions mixed in.” But as Galgon points out, that’s what you should expect when dealing with human interactions.
“This is a space where there is not always a cut and dried answer. No-one has yet come up with a fool-proof sarcasm detector – I sometimes struggle myself when I’m reading an email to get the emotion of the person who wrote it. Think about what you might see in an image that you’re looking at, versus me versus your work colleagues…”
The same is true for accuracy: “The computer is not some magic oracle that gets things right 100 percent of the time. If it’s speech to text and it’s incredibly noisy in the background, you or I are going to have trouble understanding it, and so is the computer.” Thinking about ‘how good is a person going to be at this task’ can help set expectations. For identifying a very specific breed of dog, image recognition routinely beats the average person, but if it’s recognising someone’s face in a brightly back-lit room, a human is likely to beat the computer. But as Galgon notes, “these are weaknesses we’re aware of and we’re working to improve the APIs.”
Usually, developers are surprised by how powerful the Cognitive Services APIs are. “One of the things we see is when someone comes in and starts playing with the APIs it tends to spark their interest. They start out by saying ‘I didn’t think this was possible, now I see it is – and what about this next set of things I want to try and do. This is great, I didn’t think it would even work today – so when can you add your next 10,000 categories for image classification, when are you going to have the next ten languages for LUIS model support?’ They’re saying it’s something they can start using today, but we’re always getting wish lists of the next set of things they’d love to see us do!”