Cornell University and Tel Aviv University researchers have developed a method for enabling a computer program to scan text in any of a number of languages, including English and Chinese, and autonomously and without previous information infer the underlying rules of grammar.

The rules can then be used to generate new and meaningful sentences. The method also works for such data as sheet music or protein sequences.

The development — which has a patent pending — has implications for speech recognition and for other applications in natural language engineering, as well as for genomics and proteomics. It also offers new insights into language acquisition and psycholinguistics.

“The algorithm — the computational method — for language learning and processing that we have developed can take a body of text, abstract from it a collection of recurring patterns or rules and then generate new material,” explained Shimon Edelman, a computer scientist who is a professor of psychology at Cornell and co-author of a new paper, “Unsupervised Learning of Natural Languages,” published in the Proceedings of the National Academy of Sciences (PNAS, Vol. 102, No. 33).

“This is the first time an unsupervised algorithm is shown capable of learning complex syntax, generating grammatical new sentences and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics,” he said.

More here.