hadoop - 在使用hadoop的mapreduce中,应该将记录总是等于mapInput记录或者mapoutput记录?

  显示原文与译文双语对照的内容

我正在使用hadoop中使用mapreduce的矩阵乘法示例。 我想让记录溢出的记录总是等于mapInput和mapoutput记录。 我的记录不同于mapInput和mapoutput记录

以下是我正在获取的一个测试的输出:


Three by three test
 IB = 1
 KB = 2
 JB = 1
11/12/14 13:16:22 INFO input.FileInputFormat: Total input paths to process : 2
11/12/14 13:16:22 INFO mapred.JobClient: Running job: job_201112141153_0003
11/12/14 13:16:23 INFO mapred.JobClient: map 0% reduce 0%
11/12/14 13:16:32 INFO mapred.JobClient: map 100% reduce 0%
11/12/14 13:16:44 INFO mapred.JobClient: map 100% reduce 100%
11/12/14 13:16:46 INFO mapred.JobClient: Job complete: job_201112141153_0003
11/12/14 13:16:46 INFO mapred.JobClient: Counters: 17
11/12/14 13:16:46 INFO mapred.JobClient: Job Counters
11/12/14 13:16:46 INFO mapred.JobClient: Launched reduce tasks=1
11/12/14 13:16:46 INFO mapred.JobClient: Launched map tasks=2
11/12/14 13:16:46 INFO mapred.JobClient: Data-local map tasks=2
11/12/14 13:16:46 INFO mapred.JobClient: FileSystemCounters
11/12/14 13:16:46 INFO mapred.JobClient: FILE_BYTES_READ=1464
11/12/14 13:16:46 INFO mapred.JobClient: HDFS_BYTES_READ=528
11/12/14 13:16:46 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2998
11/12/14 13:16:46 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=384
11/12/14 13:16:46 INFO mapred.JobClient: Map-Reduce Framework
11/12/14 13:16:46 INFO mapred.JobClient: Reduce input groups=36
11/12/14 13:16:46 INFO mapred.JobClient: Combine output records=0
11/12/14 13:16:46 INFO mapred.JobClient: Map input records=18
11/12/14 13:16:46 INFO mapred.JobClient: Reduce shuffle bytes=735
11/12/14 13:16:46 INFO mapred.JobClient: Reduce output records=15
11/12/14 13:16:46 INFO mapred.JobClient: Spilled Records=108
11/12/14 13:16:46 INFO mapred.JobClient: Map output bytes=1350
11/12/14 13:16:46 INFO mapred.JobClient: Combine input records=0
11/12/14 13:16:46 INFO mapred.JobClient: Map output records=54
11/12/14 13:16:46 INFO mapred.JobClient: Reduce input records=54
11/12/14 13:16:46 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
11/12/14 13:16:46 INFO input.FileInputFormat: Total input paths to process : 1
11/12/14 13:16:46 INFO mapred.JobClient: Running job: job_local_0001
11/12/14 13:16:46 INFO input.FileInputFormat: Total input paths to process : 1
11/12/14 13:16:46 INFO mapred.MapTask: io.sort.mb = 100
11/12/14 13:16:46 INFO mapred.MapTask: data buffer = 79691776/99614720
11/12/14 13:16:46 INFO mapred.MapTask: record buffer = 262144/327680
11/12/14 13:16:46 INFO mapred.MapTask: Starting flush of map output
11/12/14 13:16:46 INFO mapred.MapTask: Finished spill 0
11/12/14 13:16:46 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
11/12/14 13:16:46 INFO mapred.LocalJobRunner:
11/12/14 13:16:46 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
11/12/14 13:16:46 INFO mapred.LocalJobRunner:
11/12/14 13:16:46 INFO mapred.Merger: Merging 1 sorted segments
11/12/14 13:16:46 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 128 bytes
11/12/14 13:16:46 INFO mapred.LocalJobRunner:
11/12/14 13:16:46 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
11/12/14 13:16:46 INFO mapred.LocalJobRunner:
11/12/14 13:16:46 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
11/12/14 13:16:46 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://localhost:9000/tmp/MatrixMultiply/out
11/12/14 13:16:46 INFO mapred.LocalJobRunner: reduce> reduce
11/12/14 13:16:46 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
11/12/14 13:16:47 INFO mapred.JobClient: map 100% reduce 100%
11/12/14 13:16:47 INFO mapred.JobClient: Job complete: job_local_0001
11/12/14 13:16:47 INFO mapred.JobClient: Counters: 14
11/12/14 13:16:47 INFO mapred.JobClient: FileSystemCounters
11/12/14 13:16:47 INFO mapred.JobClient: FILE_BYTES_READ=89412
11/12/14 13:16:47 INFO mapred.JobClient: HDFS_BYTES_READ=37206
11/12/14 13:16:47 INFO mapred.JobClient: FILE_BYTES_WRITTEN=37390
11/12/14 13:16:47 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=164756
11/12/14 13:16:47 INFO mapred.JobClient: Map-Reduce Framework
11/12/14 13:16:47 INFO mapred.JobClient: Reduce input groups=9
11/12/14 13:16:47 INFO mapred.JobClient: Combine output records=9
11/12/14 13:16:47 INFO mapred.JobClient: Map input records=15
11/12/14 13:16:47 INFO mapred.JobClient: Reduce shuffle bytes=0
11/12/14 13:16:47 INFO mapred.JobClient: Reduce output records=9
11/12/14 13:16:47 INFO mapred.JobClient: Spilled Records=18
11/12/14 13:16:47 INFO mapred.JobClient: Map output bytes=180
11/12/14 13:16:47 INFO mapred.JobClient: Combine input records=15
11/12/14 13:16:47 INFO mapred.JobClient: Map output records=15
11/12/14 13:16:47 INFO mapred.JobClient: Reduce input records=9
...........X[0][0]=30, Y[0][0]=9
Bad Answer
...........X[0][1]=36, Y[0][1]=36
...........X[0][2]=42, Y[0][2]=42
...........X[1][0]=66, Y[1][0]=24
Bad Answer
...........X[1][1]=81, Y[1][1]=81
...........X[1][2]=96, Y[1][2]=96
...........X[2][0]=102, Y[2][0]=39
Bad Answer
...........X[2][1]=126, Y[2][1]=126
...........X[2][2]=150, Y[2][2]=150 

这里示例与代码一起介绍:

http://www.norstad.org/matrix-multiply/index.html

你能告诉我问题在哪里,如何得到正确? 谢谢

妇女地位妇女组织

时间: 作者:

Hadoop说,在工作过程中,"溢出的记录"会计算溢出到磁盘的记录总数,同时包括映射和减少侧溢出的记录。 这是可以能的"溢出的记录"计数为零,这是完美的。 通常,溢出记录意味着你已经超过映射输出缓冲区中可用的内存量。 拥有少量"溢出的记录"通常不是问题。 可以用RAM的设置是你的mapred-site.xml 中的io.sort.mbio.sort.spill.percent 。 如果性能是问题,那么你需要调整它们以最小化溢出的记录。 presentation 优化MapReduce作业性能有更多细节,特别是幻灯片 #12 和 #13. 如果你不止一次泄漏,那么由于需要合并溢出,需要对IO进行 3x 个罚。 如果"溢出的记录"超过"映射输出记录",那么你会做不止一次的溢出。 最终,RAM的数量受到 Java VM堆大小的限制,因这里可以能需要增加集群大小或者增加数量。

在你的特定示例中,"溢出的记录"小于"映射输出记录",因这里你不会多次发布。

作者:
...