FST (Finite-state transducers) Libraries, C++ or java

Collectives™ on Stack Overflow

Find centralized, trusted content and collaborate around the technologies you use most.

Learn more about Collectives

Teams

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Learn more about Teams

I have a problem to solve using FSTs. Basically, I'll make a morphological parser, and in this moment i have to work with large transducers. The performance is The Big issue here.

Recently, i worked in c++ in other projects where the performance matters, but now, i'am considering java, because the java's benefits and because java is getting better.

I studied some comparisons between java and c++, but I cannot decide what language i should use for this specific problem because it depends on lib in use.

I can´t find much information about java's libs, so, my question is: Are there any open source java libs in which the performance is good, like The RWTH FSA Toolkit that i read in an article that is the fastest c++ lib?

Thanks all.

What are the "benefits" of Java, for your purposes? What specific problem does that platform solve that you need? What is the performance constraint you must consider? Were the "comparisons" fair, because Java is actually extremely difficult to benchmark. So is C++, but you can at least get some algorithmic boundary guarantees from STL.

I suggest you look at OpenFst and the AT&T finite-state transducer tools. There are others out there, but I think your worry about Java puts the cart before the horse-- focus on what solves your problem well.

Good luck!

The problem here is the minimum size of your objects in Java. In C++, without virtual methods and run time type identification, your objects weight exactly their content. And the time your automata take to manipulate memory has a big impact on performance.

I think that should be the main reason for choosing C++ over Java.

OpenFST is a C++ finite state transducer framework that is really comprehensive. Some people from CMU ported it to Java for use in their natural language processing.

A blog post series describing it .
The code is located on svn .

Update: I ported it to java here

Lucene has a excellent implementation of FST, which is easy to use and high performance, making query engines like Elasticsearch, Solr deliver very fast sub-second term based query.Let me take an example:

import com.google.common.base.Preconditions;
import org.apache.lucene.store.ByteArrayDataInput;
import org.apache.lucene.store.DataInput;
import org.apache.lucene.store.GrowableByteArrayDataOutput;
import org.apache.lucene.util.BytesRef;
import org.apache.lucene.util.IntsRefBuilder;
import org.apache.lucene.util.fst.Builder;
import org.apache.lucene.util.fst.FST;
import org.apache.lucene.util.fst.PositiveIntOutputs;
import org.apache.lucene.util.fst.Util;
import java.io.IOException;
public class T {
    private final String inputValues[] = {"cat", "dog", "dogs"};
    private final long outputValues[] = {5, 7, 12};
    // https://lucene.apache.org/core/8_4_0/core/org/apache/lucene/util/fst/package-summary.html
    public static void main(String[] args) throws IOException {
        T t = new T();
        FST<Long> fst = t.buildFSTInMemory();
        System.out.println(String.format("memory used for fst is %d bytes", fst.ramBytesUsed()));
        t.searchFST(fst);
        byte[] bytes = t.serialize(fst);
        System.out.println(String.format("length of serialized fst is %d bytes", bytes.length));
        fst = t.deserialize(bytes);
        t.searchFST(fst);
    private FST<Long> buildFSTInMemory() throws IOException {
        // Input values (keys). These must be provided to Builder in Unicode sorted order! Use Collections.sort() to sort inputValues first.
        PositiveIntOutputs outputs = PositiveIntOutputs.getSingleton();
        Builder<Long> builder = new Builder<Long>(FST.INPUT_TYPE.BYTE1, outputs);
        BytesRef scratchBytes = new BytesRef();
        IntsRefBuilder scratchInts = new IntsRefBuilder();
        for (int i = 0; i < inputValues.length; i++) {
//            scratchBytes.copyChars(inputValues[i]);
            scratchBytes.bytes = inputValues[i].getBytes();
            scratchBytes.offset = 0;
            scratchBytes.length = inputValues[i].length();
            builder.add(Util.toIntsRef(scratchBytes, scratchInts), outputValues[i]);
        FST<Long> fst = builder.finish();
        return fst;
    private FST<Long> deserialize(byte[] bytes) throws IOException {
        DataInput in = new ByteArrayDataInput(bytes);
        PositiveIntOutputs outputs = PositiveIntOutputs.getSingleton();
        FST<Long> fst = new FST<Long>(in, outputs);
        return fst;
    private byte[] serialize(FST<Long> fst) throws IOException {
        final int capicity = 32;
        GrowableByteArrayDataOutput out = new GrowableByteArrayDataOutput(capicity);
        fst.save(out);
        return out.getBytes();
    private void searchFST(FST<Long> fst) throws IOException {
        for (int i = 0; i < inputValues.length; i++) {
            Long value = Util.get(fst, new BytesRef(inputValues[i]));
            Preconditions.checkState(value == outputValues[i], "fatal error");
        Thanks for contributing an answer to Stack Overflow!
Please be sure to answer the question. Provide details and share your research!
But avoid …
Asking for help, clarification, or responding to other answers.
Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.