不要命的鸡蛋 · List (或ArrayList) ...· 1 月前 · |
不开心的水煮鱼 · 使用JavaScript或Vue框架解决SV ...· 4 月前 · |
一直单身的桔子 · 字符串转bigdecimal保留3位小数 ...· 5 月前 · |
寂寞的芹菜 · js 对象的合并(3种方法)转载 - ...· 1 年前 · |
热心的皮蛋 · R语言时间序列函数大全(收藏!)-腾讯云开发 ...· 1 年前 · |
发布于 2017-01-16 22:44:46
遗憾的是,OP作为典型示例提供的文件没有标记。因此,没有直接的信息来指示给定的文本是否属于标题、摘要、引用或曾经属于哪个部分。因此,没有确定的方法来识别这些部分,而仅仅是启发式,也就是受过教育的猜测,其错误率或多或少都很大。
在OP提供的示例文档中,通过简单地检查每一行第一个字母的字体,实际上可以完成部件的标识。
下面的类构成了一个提取语义文本部分的简单框架,该框架可以通过每一行的特征来识别,以及通过仅检查每一行的第一字符的字体来识别OP示例文件中的部分的用法示例。
简单文本段提取框架
由于我只使用过PDFBox的Java,并且OP声明一个Java解决方案也是可以的,所以这个框架是用Java实现的。它基于当前2.1.0版的PDFBox开发快照。
PDFTextSectionStripper
该类构成框架的中心。它是从PDFBox
PdfTextStripper
派生的,并通过识别由
TextSectionDefinition
实例列表配置的文本节来扩展该类,参见下面的内容。一旦调用了
PdfTextStripper
方法
getText
,就会将可识别的部分作为
TextSection
实例的列表提供,请参见下面的内容。
public class PDFTextSectionStripper extends PDFTextStripper
// constructor
public PDFTextSectionStripper(List<TextSectionDefinition> sectionDefinitions) throws IOException
super();
this.sectionDefinitions = sectionDefinitions;
// Section retrieval
* @return an unmodifiable list of text sections recognized during {@link #getText(PDDocument)}.
public List<TextSection> getSections()
return Collections.unmodifiableList(sections);
// PDFTextStripper overrides
@Override
protected void writeLineSeparator() throws IOException
super.writeLineSeparator();
if (!currentLine.isEmpty())
boolean matched = false;
if (!(currentHeader.isEmpty() && currentBody.isEmpty()))
TextSectionDefinition definition = sectionDefinitions.get(currentSectionDefinition);
switch (definition.multiLine)
case multiLine:
if (definition.matchPredicate.test(currentLine))
currentBody.add(new ArrayList<>(currentLine));
matched = true;
break;
case multiLineHeader:
case multiLineIntro:
boolean followUpMatch = false;
for (int i = definition.multiple ? currentSectionDefinition : currentSectionDefinition + 1;
i < sectionDefinitions.size(); i++)
TextSectionDefinition followUpDefinition = sectionDefinitions.get(i);
if (followUpDefinition.matchPredicate.test(currentLine))
followUpMatch = true;
break;
if (!followUpMatch)
currentBody.add(new ArrayList<>(currentLine));
matched = true;
break;
case singleLine:
System.out.println("Internal error: There can be no current header or body as long as the current definition is single line only");
if (!matched)
sections.add(new TextSection(definition, currentHeader, currentBody));
currentHeader.clear();
currentBody.clear();
if (!definition.multiple)
currentSectionDefinition++;
if (!matched)
while (currentSectionDefinition < sectionDefinitions.size())
TextSectionDefinition definition = sectionDefinitions.get(currentSectionDefinition);
if (definition.matchPredicate.test(currentLine))
matched = true;
switch (definition.multiLine)
case singleLine:
sections.add(new TextSection(definition, currentLine, Collections.emptyList()));
if (!definition.multiple)
currentSectionDefinition++;
break;
case multiLineHeader:
currentHeader.addAll(new ArrayList<>(currentLine));
break;
case multiLine:
case multiLineIntro:
currentBody.add(new ArrayList<>(currentLine));
break;
break;
currentSectionDefinition++;
if (!matched)
System.out.println("Could not match line.");
currentLine.clear();
@Override
protected void endDocument(PDDocument document) throws IOException
super.endDocument(document);
if (!(currentHeader.isEmpty() && currentBody.isEmpty()))
TextSectionDefinition definition = sectionDefinitions.get(currentSectionDefinition);
sections.add(new TextSection(definition, currentHeader, currentBody));
currentHeader.clear();
currentBody.clear();
@Override
protected void writeString(String text, List<TextPosition> textPositions) throws IOException
super.writeString(text, textPositions);
currentLine.add(textPositions);
// member variables
final List<TextSectionDefinition> sectionDefinitions;
int currentSectionDefinition = 0;
final List<TextSection> sections = new ArrayList<>();
final List<List<TextPosition>> currentLine = new ArrayList<>();
final List<List<TextPosition>> currentHeader = new ArrayList<>();
final List<List<List<TextPosition>>> currentBody = new ArrayList<>();
}
TextSectionDefinition
该类指定文本节类型、名称、匹配谓词、
MultiLine
属性和多次出现标志的属性。
这个名字纯粹是描述性的。
匹配谓词是一个函数,它提供了关于文本行中字符的详细信息,并返回这一行是否与所讨论的文本节类型匹配。
MultiLine
属性可以接受以下四个不同值中的一个:
singleLine
-对于仅由一行组成的部分;
multiLine
-用于多行段,其中每一行必须与谓词匹配;
multiLineHeader
-对于第一行只需要匹配谓词的多行段,这第一行是标题行;
multiLineIntro
-对于多行段,其中第一行只需要匹配谓词,而这第一行是节的一个规则部分,可能只是由一个特殊的标记词引入。
“多次出现”标志指示是否可以存在此类型文本节的多个实例。
public class TextSectionDefinition
public enum MultiLine
singleLine, // A single line without text body, e.g. title
multiLine, // Multiple lines, all match predicate, e.g. emails
multiLineHeader, // Multiple lines, first line matches as header, e.g. h1
multiLineIntro // Multiple lines, first line matches inline, e.g. abstract
public TextSectionDefinition(String name, Predicate<List<List<TextPosition>>> matchPredicate, MultiLine multiLine, boolean multiple)
this.name = name;
this.matchPredicate = matchPredicate;
this.multiLine = multiLine;
this.multiple = multiple;
final String name;
final Predicate<List<List<TextPosition>>> matchPredicate;
final MultiLine multiLine;
final boolean multiple;
}
TextSection
该类表示此框架所识别的文本部分。
public class TextSection
public TextSection(TextSectionDefinition definition, List<List<TextPosition>> header, List<List<List<TextPosition>>> body)
this.definition = definition;
this.header = new ArrayList<>(header);
this.body = new ArrayList<>(body);
@Override
public String toString()
StringBuilder stringBuilder = new StringBuilder();
stringBuilder.append(definition.name).append(": ");
if (!header.isEmpty())
stringBuilder.append(toString(header));
stringBuilder.append('\n');
for (List<List<TextPosition>> bodyLine : body)
stringBuilder.append(" ").append(toString(bodyLine)).append('\n');
return stringBuilder.toString();
String toString(List<List<TextPosition>> words)
StringBuilder stringBuilder = new StringBuilder();
boolean first = true;
for (List<TextPosition> word : words)
if (first)
first = false;
stringBuilder.append(' ');
for (TextPosition textPosition : word)
stringBuilder.append(textPosition.getUnicode());
// cf. https://stackoverflow.com/a/7171932/1729265
return Normalizer.normalize(stringBuilder, Form.NFKC);
final TextSectionDefinition definition;
final List<List<TextPosition>> header;
final List<List<List<TextPosition>>> body;
}
有关
Normalizer.normalize(stringBuilder, Form.NFKC)
调用的内容。
这个答案
到堆栈溢出问题
“分离Unicode连接字符”
。
示例使用
On可以使用此框架与非常简单的匹配谓词来识别OP提供的代表性示例中的部分:
List<TextSectionDefinition> sectionDefinitions = Arrays.asList(
new TextSectionDefinition("Titel", x->x.get(0).get(0).getFont().getName().contains("CMBX12"), MultiLine.singleLine, false),
new TextSectionDefinition("Authors", x->x.get(0).get(0).getFont().getName().contains("CMR10"), MultiLine.multiLine, false),
new TextSectionDefinition("Institutions", x->x.get(0).get(0).getFont().getName().contains("CMR9"), MultiLine.multiLine, false),
new TextSectionDefinition("Addresses", x->x.get(0).get(0).getFont().getName().contains("CMTT9"), MultiLine.multiLine, false),
new TextSectionDefinition("Abstract", x->x.get(0).get(0).getFont().getName().contains("CMBX9"), MultiLine.multiLineIntro, false),
new TextSectionDefinition("Section", x->x.get(0).get(0).getFont().getName().contains("CMBX12"), MultiLine.multiLineHeader, true)
PDDocument document = PDDocument.load(resource);
PDFTextSectionStripper stripper = new PDFTextSectionStripper(sectionDefinitions);
stripper.getText(document);
System.out.println("Sections:");
List<String> texts = new ArrayList<>();
for (TextSection textSection : stripper.getSections())
String text = textSection.toString();
System.out.println(text);
texts.add(text);
Files.write(new File(RESULT_FOLDER, "Wang05a.txt").toPath(), texts);
(
https://github.com/mkl-public/testarea-pdfbox2/blob/master/src/test/java/mkl/testarea/pdfbox2/extract/ExtractTextSections.java#L44
测试方法
testWang05a
__)
缩短的结果:
Titel: How to Break MD5 and Other Hash Functions
Authors:
Xiaoyun Wang and Hongbo Yu
Institutions:
Shandong University, Jinan 250100, China,
Addresses:
xywang@sdu.edu.cn, yhb@mail.sdu.edu.cn
Abstract:
Abstract. MD5 is one of the most widely used cryptographic hash func-
tions nowadays. It was designed in 1992 as an improvement of MD4, and
Section: 1 Introduction
People know that digital signatures are very important in information security.
The security of digital signatures depends on the cryptographic strength of the
Section: 2 Description of MD5
In order to conveniently describe the general structure of MD5, we first recall
the iteration process for hash functions.
Section: 3 Differential Attack for Hash Functions
3.1 The Modular Differential and the XOR Differential
The most important analysis method for hash functions is differential attack
Section: 4 Differential Attack on MD5
4.1 Notation
Before presenting our attack, we first introduce some notation to simplify the
Section: 5 Summary
In this paper we described a powerful attack against hash functions, and in
particular showed that finding a collision of MD5 is easily feasible.
Section: Acknowledgements
It is a pleasure to acknowledge Dengguo Feng for the conversations that led to
this research on MD5. We would like to thank Eli Biham, Andrew C. Yao, and