I've been working on predicting English keywords in source codecomments with access only to the method's compiled bytecodes. I'musing only the comments attached a whole Java method. I've puttogether a collection of 330 thousand de-compiled Java methods plustheir corresponding Java Doc textual comments from the Debianarchive. From there, I trained a machine learning ensemble ofclassifiers. The machine learning takes a .class file and gives youback for each method a set of possible keywords. For example, a methodwith lots of fadd, fmul, fdiv could be described as a "calculation".
This is really new work, while it is still up to the community to seewhether there's value on the technology itself at this stage, I'mmaking the data and the machine learning scripts available as part ofthe talk. I'll include enough Machine Learning background to enticeeverybody in the audience to give a try themselves to experimentingwith the data.