java - 如何有效地将大数据结构写入文件?

  显示原文与译文双语对照的内容

我有一个类型的变量 HashMap<String, HashSet<Long>> 它的大小可以达到 100 MB 。 我需要把这个写到二级存储器里。

序列化不是一个选项,因为它对我来说太慢了。 是否还有其他更好的方法将字节结构转储到硬盘驱动器中?

PS: 我只担心写磁盘的速度,慢读不是问题。

时间: 作者:

你可以自己 serialize 。 你还可以压缩数据以使它的更小。


public static void write(String filename, Map<String, Set<Long>> data) throws IOException {
 try (DataOutputStream dos = new DataOutputStream(new BufferedOutputStream(
 new DeflaterOutputStream(new FileOutputStream(filename))))) {
 dos.writeInt(data.size());
 for (Map.Entry<String, Set<Long>> entry : data.entrySet()) {
 dos.writeUTF(entry.getKey());
 Set<Long> value = entry.getValue();
 dos.writeInt(value.size());
 for (Long l : value) {
 dos.writeLong(l);
 }
 }
 }
}

你只读同样的东西,而不是写。


public static Map<String, Set<Long>> read(String filename) throws IOException {
 Map<String, Set<Long>> ret = new LinkedHashMap<>();
 try (DataInputStream dis = new DataInputStream(new BufferedInputStream(
 new InflaterInputStream(new FileInputStream(filename))))) {
 for (int i = 0, size = dis.readInt(); i <size; i++) {
 String key = dis.readUTF();
 Set<Long> values = new LinkedHashSet<>();
 ret.put(key, values);
 for (int j = 0, size2 = dis.readInt(); j <size2; j++)
 values.add(dis.readLong());
 }
 }
 return ret;
}

public static void main(String... ignored) throws IOException {
 Map<String, Set<Long>> map = new LinkedHashMap<>();
 for (int i = 0; i <20000; i++) {
 Set<Long> set = new LinkedHashSet<>();
 set.add(System.currentTimeMillis());
 map.put("key-" + i, set);
 }
 for (int i = 0; i <5; i++) {
 long start = System.nanoTime();
 File file = File.createTempFile("delete","me");
 write(file.getAbsolutePath(), map);
 Map<String, Set<Long>> map2 = read(file.getAbsolutePath());
 if (!map2.equals(map)) {
 throw new AssertionError();
 }
 long time = System.nanoTime() - start;
 System.out.printf("With %,d keys, the file used %.1f KB, took %.1f to write/read ms%n", map.size(), file.length()/1024.0, time/1e6);
 file.delete();
 }
}

打印


With 20,000 keys, the file used 44.1 KB, took 155.2 to write/read ms
With 20,000 keys, the file used 44.1 KB, took 84.9 to write/read ms
With 20,000 keys, the file used 44.1 KB, took 51.6 to write/read ms
With 20,000 keys, the file used 44.1 KB, took 21.4 to write/read ms
With 20,000 keys, the file used 44.1 KB, took 21.6 to write/read ms

21 20K 个条目,每个条目仅使用 2.2字节。

作者:

使用任何合适的序列化库( 其中一些是快速的- 例如谷歌协议缓存是快速的,并发出小消息) 以适当的形式获取数据,然后将它的压缩在内存中并将结果转储到磁盘。

大多数情况下,磁盘IO时间将是你的主要瓶颈,因这里压缩可以帮助大大减少。

作者:
...