CPython合并排序的性能曲线出乎意料

Question 1

我在Python中实现了一种天真的合并排序算法。算法和测试代码如下。

import time
import random
import matplotlib.pyplot as plt
import math
from collections import deque
def sort(unsorted):
    if len(unsorted) <= 1:
        return unsorted
    to_merge = deque(deque([elem]) for elem in unsorted)
    while len(to_merge) > 1:
        left = to_merge.popleft()
        right = to_merge.popleft()
        to_merge.append(merge(left, right))
    return to_merge.pop()
def merge(left, right):
    result = deque()
    while left or right:
        if left and right:
            elem = left.popleft() if left[0] > right[0] else right.popleft()
        elif not left and right:
            elem = right.popleft()
        elif not right and left:
            elem = left.popleft()
        result.append(elem)
    return result
LOOP_COUNT = 100
START_N = 1
END_N = 1000
def test(fun, test_data):
    start = time.clock()
    for _ in xrange(LOOP_COUNT):
        fun(test_data)
    return  time.clock() - start
def run_test():
    timings, elem_nums = [], []
    test_data = random.sample(xrange(100000), END_N)
    for i in xrange(START_N, END_N):
        loop_test_data = test_data[:i]
        elapsed = test(sort, loop_test_data)
        timings.append(elapsed)
        elem_nums.append(len(loop_test_data))
        print "%f s --- %d elems" % (elapsed, len(loop_test_data))
    plt.plot(elem_nums, timings)
    plt.show()
run_test()
在我看来，一切都很好，我应该得到一个漂亮的N*logN曲线作为结果。但是，图片上的情况却有点不同。
PyPy. The curve is ok. 
Disabled the GC using the gc module. Wrong guess. Debug output showed that it doesn't even run until the end of the test.
Memory profiling using meliae - nothing special or suspicious.
I had another implementation (a recursive one using the same merge function), it acts the similar way. The more full test cycles I create - the more "jumps" there are in the curve.
那么，如何解释这种行为并--希望--修复它呢？
更新：将列表改为collection.deque
UPD2：添加了完整的测试代码
UPD3：我在Ubuntu 11.04操作系统上使用Python 2.7.1，使用一台四核2Hz的笔记本。我试着关掉了大部分其他进程：尖峰的数量减少了，但至少有一个进程还在。

Question 2


          
           你只是在捡拾其他进程对你机器的影响。
          
          
           你对输入大小1运行你的排序函数100次，并记录在这上面花费的总时间。然后你对输入大小为2的函数运行100次，并记录所花费的总时间。你继续这样做，直到你达到输入大小1000。
          
          
           假设你的操作系统（或你自己）偶尔会开始做一些CPU密集型的工作。假设这个 "尖峰 "持续的时间与你运行排序函数5000次的时间一样长。这意味着在5000/100=50个连续的输入大小时，执行时间会显得很慢。过了一会儿，又发生了一个尖峰，另一个范围的输入大小看起来很慢。这正是你在图表中看到的情况。
          
          
           我可以想到一个方法来避免这个问题。对每个输入大小只运行一次你的排序函数：1，2，3，...，1000。重复这个过程100次，使用同样的1000个输入（这很重要，见后面的解释）。现在把
           
            minimum
           
           每个输入尺寸所花费的时间作为图表的最终数据点。
          
          
           这样，在100次运行中，你的尖峰应该只影响每个输入尺寸的几次；而且由于你采取的是最小值，它们可能对最终的图表完全没有影响。
          
          
           如果你的尖峰真的非常长和频繁，你当然可能想增加重复次数，超过目前每个输入大小100次。
          
          
           看一下你的峰值，我注意到在一个峰值期间，执行速度正好慢了3倍。我猜测操作系统在高负荷时给你的python进程提供了三个中的一个插槽。不管我的猜测是否正确，我推荐的方法应该可以解决这个问题。
          
          
           EDIT:
          
          
           我意识到，在我提出的解决你的问题的方案中，我没有澄清一点。
          
          
           在给定的输入规模下，你应该在你的100次运行中每次都使用相同的输入吗？还是应该使用100个不同的（随机）输入？
          
          
           因为我建议取执行时间的最小值，所以输入应该是相同的（否则你会得到不正确的输出，因为你将测量最佳情况下的算法复杂性，而不是平均复杂性！）。
          
          
           但当你采取相同的输入时，你会在图表中产生一些噪音，因为一些输入根本比其他输入快。
          
          
           因此，一个更好的解决方案是解决系统负载问题，而不产生每个输入大小只有一个输入的问题（这显然是伪代码）。
          
          seed = 'choose whatever you like'
repeats = 4
inputs_per_size = 25
runtimes = defaultdict(lambda : float('inf'))
for r in range(repeats):
  random.seed(seed)
  for i in range(inputs_per_size):
    for n in range(1000):
      input = generate_random_input(size = n)
      execution_time = get_execution_time(input)
      if runtimes[(n, i)] > execution_time:
        runtimes[(n,i)] = execution_time
for n in range(1000):
  runtimes[n] = sum(runtimes[(n,i)] for i in range(inputs_per_size))/inputs_per_size
现在你可以使用runtimes[n]来构建你的情节。
当然，取决于你的系统是否超级嘈杂，你可能会把(repeats, inputs_per_size)从(4,25)改为(10,10)，甚至是(25,4)。

Question 3


          
           
            我可以用你的代码重现尖峰现象。
           
           
            你应该选择一个合适的计时函数（
            
             time.time()
            
            与
            
             time.clock()
            
            --
            
             from timeit import default_timer
            
            ），一个测试的重复次数（每个测试需要多长时间），以及测试的数量来选择
            
             最低限度
            
            时间从。它给你一个更好的精度，并减少对结果的外部影响。阅读来自的说明
            
             
              timeit.Timer.repeat()
             
            
            docs:
           
           
            从结果向量中计算出平均数和标准差是很诱人的。
  向量并报告这些数据。然而，这并不十分有用。在一个
  在一个典型的例子中，最低值给出了一个下限，即你的机器能以多快的速度
  机器能够运行给定代码片段的速度的下限；结果向量中较高的值通常不是由变化引起的。
  向量中的较高数值通常不是由Python速度的变化引起的，而是由
  是由于其他进程干扰了你的计时精度。所以
            
             的min()
  的结果可能是你应该关注的唯一数字。
            
            之后，你应该看一下整个矢量，并应用常识而不是统计数字。
  常识，而不是统计数据。
           
           
            替换代码4】模块可以为你选择合适的参数。
           
           $ python -mtimeit -s 'from m import testdata, sort; a = testdata[:500]' 'sort(a)'
|------------------------------+-------------------|
| Fitting polynom              | Function          |
|------------------------------+-------------------|
| 1.00  log2(N)   +  1.25e-015 | N                 |
| 2.00  log2(N)   +  5.31e-018 | N*N               |
| 1.19  log2(N)   +      1.116 | N*log2(N)         |
| 1.37  log2(N)   +      2.232 | N*log2(N)*log2(N) |
为了生成这个数字，我使用了make-figures.py:
$ python make-figures.py --nsublists 1 --maxn=0x100000 -s vkazanov.msort -s vkazanov.msort_builtin 
where:
# adapt sorting functions for make-figures.py
def msort(lists):
    assert len(lists) == 1
    return sort(lists[0]) # `sort()` from the question
def msort_builtin(lists):