正在加载图片...
Page 8 of 24 Control #2858 3.3.1 Validation Against Existing Difficulty Ratings For each of the difficulty ratings in isupereasy, veryeasy, easy, medium, hard, harder veryhard, superhard], we downloaded a set of 100 puzzles from [8 to obtain a different difficulty rating to compare with. While this dataset is not a standard difficulty benchmark no other large datasets with varying difficulty ratings were available, and we are looking only for correlation, since there is no objective definition of difficulty We ran hsolve on each puzzle 20 times and recorded the average difficulty for each board We then classified the boards by difficulty into 8 groups of 100 boards based on our difficulty metric. The table of results is shown below Difficulty1|23|4|5678 supereasy 811900 reryeasy19681210000 08383318121 x2=6350(df=49) hard 0210192030118=0.82 harder00572226136|4 veryhard019716132727 superhard0004 2161 Performing a x2-test for independence and computing the Goodman-Kruskal y coefficient 61, we obtain that x=6350 and y=0.82. Note that this corresponds to a p value of less than 0.0001 for the x test, meaning that there is a statistically significant deviation from independence between these two measures of difficulty [7 Furthermore, the Goodman-Kruskal coefficient y=0.82 is relatively close to 1, indicating a somewhat strong correlation between our measure of difficulty and the existing metric This provides some support for the validity of our metric; more precise error analysis seems unnecessary here because we wish only to check that our values are close to those provided by others 3.3.2 Validation of Difficulty Distribution When run 20 times on each of 1000 typical Sudoku puzzles obtained from [12 , hsolve generates the following distribution for measured difficulty. As can be seen in Figure 1, the distribution is sharply peaked near 500 and has a long tail towards higher difficulty 4, We can compare this difficulty distribution plot with the distribution of times required visitorstowww.websudoku.comtosolvethepuzzlesavailablethere21.Thisdistribution is generated by the solution times of millions of users and is shown in the plot in Figure 2 The two distribution graphs both share a peak near 0 and have fat tails in the positive direction. While the tail of our measured difficulties is somewhat fatter, it exhibits the same qualitative behavior as a distribution of solve times generated by millions of real users, again providing validation for our difficulty metric.Page 8 of 24 Control #2858 3.3.1 Validation Against Existing Difficulty Ratings For each of the difficulty ratings in {supereasy,veryeasy,easy,medium,hard,harder, veryhard,superhard}, we downloaded a set of 100 puzzles from [8] to obtain a different difficulty rating to compare with. While this dataset is not a standard difficulty benchmark, no other large datasets with varying difficulty ratings were available, and we are looking only for correlation, since there is no objective definition of difficulty. We ran hsolve on each puzzle 20 times and recorded the average difficulty for each board. We then classified the boards by difficulty into 8 groups of 100 boards based on our difficulty metric. The table of results is shown below: Difficulty 1 2 3 4 5 6 7 8 supereasy 81 19 0 0 0 0 0 0 veryeasy 19 68 12 1 0 0 0 0 easy 0 8 38 33 18 2 1 0 medium 0 2 26 29 22 17 4 0 hard 0 2 10 19 20 30 11 8 harder 0 0 5 7 22 26 36 4 veryhard 0 1 9 7 16 13 27 27 superhard 0 0 0 4 2 12 21 61 χ 2 = 6350 (df = 49) γ = 0.82 Performing a χ 2 -test for independence and computing the Goodman-Kruskal γ coefficient [6], we obtain that χ 2 = 6350 and γ = 0.82. Note that this corresponds to a p value of less than 0.0001 for the χ 2 test, meaning that there is a statistically significant deviation from independence between these two measures of difficulty [7]. Furthermore, the Goodman-Kruskal coefficient γ = 0.82 is relatively close to 1, indicating a somewhat strong correlation between our measure of difficulty and the existing metric. This provides some support for the validity of our metric; more precise error analysis seems unnecessary here because we wish only to check that our values are close to those provided by others. 3.3.2 Validation of Difficulty Distribution When run 20 times on each of 1000 typical Sudoku puzzles obtained from [12], hsolve generates the following distribution for measured difficulty. As can be seen in Figure 1, the distribution is sharply peaked near 500 and has a long tail towards higher difficulty. We can compare this difficulty distribution plot with the distribution of times required for visitors to www.websudoku.com to solve the puzzles available there [21]. This distribution is generated by the solution times of millions of users and is shown in the plot in Figure 2. The two distribution graphs both share a peak near 0 and have fat tails in the positive direction. While the tail of our measured difficulties is somewhat fatter, it exhibits the same qualitative behavior as a distribution of solve times generated by millions of real users, again providing validation for our difficulty metric. 8
<<向上翻页向下翻页>>
©2008-现在 cucdc.com 高等教育资讯网 版权所有