Sample code snippet on extracting font information line by line using PDFBox API in JAVA.
public String[] getFontLineByLineFromPdf(String fileName)throws IOException
{
PDDocument doc= PDDocument.load(fileName);
PDFTextStripper stripper = new PDFTextStripper() {
String prevBaseFont = "";
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
{
StringBuilder builder = new StringBuilder();
for (TextPosition position : textPositions)
{
String baseFont = position.getFont().getBaseFont();
if (baseFont != null && !baseFont.equals(prevBaseFont))
{
builder.append('[').append(baseFont).append(']');
prevBaseFont = baseFont;
}
builder.append(position.getCharacter());
}
writeString(builder.toString());
}
};
String content=stripper.getText(doc);
doc.close();
String pdfLinesWithFont[]= content.split("\\r?\\n");
return pdfLinesWithFont;
}

you can try this free online pdf to text converter to convert pdf to text online.
ReplyDeleteExtracting text information from PDF by using PDFtoText Extraction API in .NET
ReplyDelete