Recognizing company and individual names – part 1

I recently talked about the possibility of using machine learning to automatically discover themes, targets and relationships from intelligence reporting.

This has a number of interesting applications. In this particular instance, we were extracting names and topics to aggregate click statistics, but this could also be used to recommend related reports to analysts, visualize relationships between targets, and many others.

The most basic approach is to automatically recognize the names of people, places and organizations (a “named entity”) in a piece of text. Consider the following (extremely basic) intelligence report.

If we had to tag these reports with a particular target or topic, you and I could easily pick out “John Smith” and/or “Bank of Australia” as important subjects.

A computer, on the other hand, only sees a stream of letters and symbols. It has no inherent concept of which are important and which are not. If we want a computer to be able to recognize names, we will need to give it a set of rules to follow (i.e. some kind of computer program).

We could do this by just listing out every possible name we can think of, telling the computer to flag any sequence of symbols that match the list.

This could work, but it would be very time-consuming and prone to error. It’s difficult, if not impossible, to generate an exhaustive list of every single possible name currently in the English language (let alone foreign languages). That’s without even considering misspellings, new names or spelling variations that might arise in future.

Ideally, we would build a set of more general, flexible rules to estimate how likely a sequence of symbols is to be a named entity.

A basic starting rule, for example, might be to mark any capitalized word as a named entity.

Good starting point, but this isn’t exactly correct, as “John Smith” should be marked as one name, rather than two separate names (“John” and “Smith”). This also incorrectly marks the first word of every sentence as a name.

OK, let’s refine this by adding another rule that says mark any two successive capitalized words as a single named entity.

Better, but this now fails to pick up names separated by prepositions (such as “of”) or single-word names. We would need to refine it even further.

You’re probably beginning to appreciate that hand-crafted rules like this end up being a complicated nest of exceptions, modifications and “but-ifs”.

If possible, we’d prefer to use machine learning to take a bunch of pre-labelled data, and automatically generate these kind of rules.

There are a number of ways to do this, but I’ll just focus on one, known as the Conditional Random Field (“CRF”) model.

In Part 2, I’ll go through the CRF model to see how it could be used to identify named entities within a sequence of words.