《数学建模》美赛优秀论文：A Difficulty Metric and Puzzle Generator for Sudoku.pdf_大学文库

Page 1 of 24 Control #2858 Contents 1 Introduction 2 1.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Problem Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Problem Setup 4 2.1 Difficulty Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Puzzle Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 3 A Difficulty Metric 5 3.1 Assumptions and Metric Development . . . . . . . . . . . . . . . . . . . . . 5 3.2 hsolve and Metric Calculation . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3.1 Validation Against Existing Difficulty Ratings . . . . . . . . . . . . . 8 3.3.2 Validation of Difficulty Distribution . . . . . . . . . . . . . . . . . . . 8 3.3.3 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4 Generator 10 4.1 Standard Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 4.1.1 Guaranteeing a Unique Solution with Standard Generator . . . . . . 11 4.2 Pseudo-Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4.2.1 Differences From Standard Generator . . . . . . . . . . . . . . . . . . 12 4.2.2 Pseudo-Generator Puzzle Variability . . . . . . . . . . . . . . . . . . 12 4.3 Results and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.3.1 Difficulty Concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.3.2 Generating Puzzles with a Specific Difficulty . . . . . . . . . . . . . . 13 4.3.3 Standard Generator Runtime . . . . . . . . . . . . . . . . . . . . . . 14 4.3.4 Using Pseudo-Generator to Generate Difficult Puzzles . . . . . . . . . 14 5 Conclusion 15 5.1 Strengths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.2 Weaknesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 5.3 Alternative Approaches and Future Work . . . . . . . . . . . . . . . . . . . . 16 A Sudoku Strategies 17 B Source Code for hsolve 22 1

Page 3 of 24 Control #2858 We will refer to a cell by an ordered pair(i,j), where i is its row and j its column, and the term group will collectively denote a row, column, or region. Given a Sudoku board B, define the Sudoku Solution Graph(SSG)s(B) of b to be th structure that associates to each cell in b the set of the digits that are currently thought to be possible candidates for the cell. For example, in the first Sudoku puzzle presented, cell 9, 9)cannot take the values 1, 3, 4, 7, 8, 9 because it shares a group with cells with these values. Therefore, this cell has values ( 2, 5, 6) in the corresponding SSG To solve a Sudoku board, a player generally applies a number of different strategies, or patterns of logical deduction. These range from fairly obvious to extremely complicated, and a list of Sudoku strategies used in this paper is provided in Appendix A. In this paper, we assume the SSg has been evaluated for every cell on the Sudoku board before applying any strategies 1.2 Problem Background Since its recent rise in popularity, Sudoku has been the focus of increased academic and recreational study. Most of the efforts directed at Sudoku have been directed at solving Sudoku puzzles or analyzing the computational complexity of resolving Sudoku as done 13],4],and[14] via a reduction to an exact cover problem and an application of Knuth's Algorithm X given in [11], and solving the n2 x n2 generalization of Sudoku is known to be NP-complete as a consequence of results given by Yato in 22 o. However, a perhaps deeper and more difficult issue involves rating the difficulty of and nerating Sudoku puzzles. This problem encompasses the following two questi 1. Given a specific Sudoku puzzle, how does one define and determine its difficulty 2. Given a specified difficulty, how does one generate a Sudoku puzzle of this difficulty? While generating a valid Sudoku puzzle is not too complex, the non-local and unclear rocess of deduction makes determining or specifying a difficulty much more complicated Traditional approaches to difficulty rating involve rating a puzzle by the strategies neces- sary to find the solution, while some other approaches have been proposed in 1] and 3. In particular, a genetic algorithms approach taken by Mantere et. al. in [15] found some corre- lation with human-rated difficulties, and Simonis presents some interesting similar findings with a constraint-based rating 17. However, in both cases, the correlation is not completely clear Puzzle generation seems to be an even more difficult issue. Most existing generators use complete search algorithms to systematically add numbers to cells in a grid until a unique solution is found. To generate a puzzle of a given difficulty, this process is repeated until the desired difficulty is achieved. This is the approach found in [15], while [17 posits both this and a similar method based upon removal of cells from a completed board. In 5 Felgenhauer et. al. has enumerated the total possible number of valid Sudoku puzzles In this paper, we present a new approach to rating and generating puzzle create hsolve, a program to simulate the way a human solver approaches a sudoku puzzle in order to present a new solver-based difficulty metric based upon hsolve's simulation of human

Page 3 of 24 Control #2858 We will refer to a cell by an ordered pair (i, j), where i is its row and j its column, and the term group will collectively denote a row, column, or region. Given a Sudoku board B, define the Sudoku Solution Graph (SSG) S(B) of B to be the structure that associates to each cell in B the set of the digits that are currently thought to be possible candidates for the cell. For example, in the first Sudoku puzzle presented, cell (9, 9) cannot take the values {1, 3, 4, 7, 8, 9} because it shares a group with cells with these values. Therefore, this cell has values {2, 5, 6} in the corresponding SSG. To solve a Sudoku board, a player generally applies a number of different strategies, or patterns of logical deduction. These range from fairly obvious to extremely complicated, and a list of Sudoku strategies used in this paper is provided in Appendix A. In this paper, we assume the SSG has been evaluated for every cell on the Sudoku board before applying any strategies. 1.2 Problem Background Since its recent rise in popularity, Sudoku has been the focus of increased academic and recreational study. Most of the efforts directed at Sudoku have been directed at solving Sudoku puzzles or analyzing the computational complexity of resolving Sudoku as done in [13], [4], and [14]. Most notably, Sudoku can now be solved extremely quickly via a reduction to an exact cover problem and an application of Knuth’s Algorithm X given in [11], and solving the n 2 × n 2 generalization of Sudoku is known to be NP-complete as a consequence of results given by Yato in [22]. However, a perhaps deeper and more difficult issue involves rating the difficulty of and generating Sudoku puzzles. This problem encompasses the following two questions: 1. Given a specific Sudoku puzzle, how does one define and determine its difficulty? 2. Given a specified difficulty, how does one generate a Sudoku puzzle of this difficulty? While generating a valid Sudoku puzzle is not too complex, the non-local and unclear process of deduction makes determining or specifying a difficulty much more complicated. Traditional approaches to difficulty rating involve rating a puzzle by the strategies necessary to find the solution, while some other approaches have been proposed in [1] and [3]. In particular, a genetic algorithms approach taken by Mantere et. al. in [15] found some correlation with human-rated difficulties, and Simonis presents some interesting similar findings with a constraint-based rating [17]. However, in both cases, the correlation is not completely clear. Puzzle generation seems to be an even more difficult issue. Most existing generators use complete search algorithms to systematically add numbers to cells in a grid until a unique solution is found. To generate a puzzle of a given difficulty, this process is repeated until the desired difficulty is achieved. This is the approach found in [15], while [17] posits both this and a similar method based upon removal of cells from a completed board. In [5], Felgenhauer et. al. has enumerated the total possible number of valid Sudoku puzzles. In this paper, we present a new approach to rating and generating puzzles. We create hsolve, a program to simulate the way a human solver approaches a Sudoku puzzle in order to present a new solver-based difficulty metric based upon hsolve’s simulation of human 3

Page 4 of 24 Control #2858 solving behavior. We then propose two different methods again based on solve to generate Sudoku puzzles of different difficulties 2 Problem Setup In this paper, we approach the problem in two portions. First, we define a difficulty metric for a Sudoku puzzle and determine an algorithm to compute it. We then create an algorithm to generate Sudoku puzzles with desired levels of difficulty. Before beginning to describe our model, however, we first specify the problem a bit more closely in both cases 2.1 Difficulty metric Our goal in this part of the paper is to create an algorithm that takes a Sudoku puzzle and returns a real number that represents its abstract "difficulty"according to some metric The central issue in determining such a metric is then what we mean by the "difficulty of a Sudoku puzzle. We base our definition of difficulty on the following general assumptions 1. The amount of time it takes for any given solver to solve a puzzle is monotonically increasing with difficulty 2. Every solver tries to solve Sudoku puzzles by applying various strategies The difficulty of a puzzle is an inherently subjective quantity and may vary among different solvers who use different methods of solving Sudoku puzzles. However, this type of subjective consideration is not inherent to the puzzle, so we must restrict ourselves to purely objective reactions. For this purpose, it is a useful intellectual tool to consider some hypothetical typical"solver of some postulated skill level Now, Assumption 1 suggests that we should measure difficulty by the amount of time spent solving the puzzle. Because difficulty is increasing in time spent, any definition of difficulty must give the same ordering as time spent. Since this gives an objectively precise definition, we adopt it; therefore, it only remains to determine more precisely what type of solver our metric should consider The difficulties of puzzles can vary for solvers of different skill levels, and it may be arg that one should measure each of these difficulties separately. However, we are assuming that all solvers' time spent will be consistent with difficulty, and assigning absolute difficulty alues has much practical value. Therefore, in order to avoid the dependence of our results on the novice's ignorance of various techniques and to extend the range of measurable puzzles we take our hypothetical solver to be an expert Hence, we define the difficulty of a Sudoku puzzle to be the average amount of time a hypothetical"typical " Sudoku expert would spend solving it. In the subsequent parts of this paper, our aim will be to create and evaluate a metric to determine this time 2.2 Puzzle generation A second objective of this paper is to develop a method of puzzle generation. Our main goal in puzzle generation is to produce a valid puzzle of a given desired difficulty level that has a

Page 4 of 24 Control #2858 solving behavior. We then propose two different methods again based on hsolve to generate Sudoku puzzles of different difficulties. 2 Problem Setup In this paper, we approach the problem in two portions. First, we define a difficulty metric for a Sudoku puzzle and determine an algorithm to compute it. We then create an algorithm to generate Sudoku puzzles with desired levels of difficulty. Before beginning to describe our model, however, we first specify the problem a bit more closely in both cases. 2.1 Difficulty Metric Our goal in this part of the paper is to create an algorithm that takes a Sudoku puzzle and returns a real number that represents its abstract “difficulty” according to some metric. The central issue in determining such a metric is then what we mean by the “difficulty” of a Sudoku puzzle. We base our definition of difficulty on the following general assumptions: 1. The amount of time it takes for any given solver to solve a puzzle is monotonically increasing with difficulty. 2. Every solver tries to solve Sudoku puzzles by applying various strategies. The difficulty of a puzzle is an inherently subjective quantity and may vary among different solvers who use different methods of solving Sudoku puzzles. However, this type of subjective consideration is not inherent to the puzzle, so we must restrict ourselves to purely objective reactions. For this purpose, it is a useful intellectual tool to consider some hypothetical “typical” solver of some postulated skill level. Now, Assumption 1 suggests that we should measure difficulty by the amount of time spent solving the puzzle. Because difficulty is increasing in time spent, any definition of difficulty must give the same ordering as time spent. Since this gives an objectively precise definition, we adopt it; therefore, it only remains to determine more precisely what type of solver our metric should consider. The difficulties of puzzles can vary for solvers of different skill levels, and it may be argued that one should measure each of these difficulties separately. However, we are assuming that all solvers’ time spent will be consistent with difficulty, and assigning absolute difficulty values has much practical value. Therefore, in order to avoid the dependence of our results on the novice’s ignorance of various techniques and to extend the range of measurable puzzles, we take our hypothetical solver to be an expert. Hence, we define the difficulty of a Sudoku puzzle to be the average amount of time a hypothetical “typical” Sudoku expert would spend solving it. In the subsequent parts of this paper, our aim will be to create and evaluate a metric to determine this time. 2.2 Puzzle Generation A second objective of this paper is to develop a method of puzzle generation. Our main goal in puzzle generation is to produce a valid puzzle of a given desired difficulty level that has a 4

Page 5 of 24 Control #2858 unique solution. As there is no objective scale in difficulty and no fixed criteria that indicate the difference between puzzles of varying difficulties, we will interpret the given numbe of difficulty levels and the desired difficulty level as the division of the set of all Sudoku boards into the given number of difficulty percentile intervals of equal size and the choice of a particular desired percentile interval. We will take a sample of 1000 Sudoku boards and assume that the difficulty distribution of these boards, as measured by our human solver representative of the difficulty distribution of all Sudoku board Our second goal in puzzle generation is to minimize the complexity of the generation algorithm. We note that since the size of the 9 x 9 Sudoku board is fixed. the order of growth of our generation algorithm is irrelevant. Therefore, in this paper, we will interpret the complexity of a generation algorithm to be the expected amount of execution time required to find a Sudoku puzzle of the desired difficulty level 3 A Difficulty metric In this section, we describe and evaluate a difficulty metric for Sudoku puzzles based up a computer solver hsolve that simulates the actions of a human Sudoku expert 3.1 Assumptions and Metric Development As stated in Section 2, we define the difficulty of a puzzle as the average amount of tim expert Sudoku solver to solve it. In orde e this t essentially two possibilities model the process of solving the puzzle 2. Find some heuristic for board configurations that predicts the solve time There are some known heuristics for finding the difficulty of a puzzle; most notably, puzzles with a small number of initial givens are somewhat harder than most. However, according to 91, the overall correlation is weak. Other methods of this type have similar problems, and more importantly, they cannot be used to determine the difficulty of any specific puzzle Therefore, the second possibility can be used at most as a guide for the first, and we must model the actual process of solving the puzzle. We postulate that the following assumptions hold for the solver. 1. The possible strategies can be ranked in order of difficulty, and the solver always applies them from least to most difficult puzzles.We use a widely accepted ranking of strategies described in Appendix oku This is consistent with more or less all references on the subject of solving sudo 2. During the search for a strategy application, each ordering of possible strategy appli- cations occurs with equal probability There are two components of a human search for a possible location to apply a strategy complete search and intuitive pattern recognition. While human pattern recognition

Page 5 of 24 Control #2858 unique solution. As there is no objective scale in difficulty and no fixed criteria that indicate the difference between puzzles of varying difficulties, we will interpret the given number of difficulty levels and the desired difficulty level as the division of the set of all Sudoku boards into the given number of difficulty percentile intervals of equal size and the choice of a particular desired percentile interval. We will take a sample of 1000 Sudoku boards and assume that the difficulty distribution of these boards, as measured by our human solver, is representative of the difficulty distribution of all Sudoku boards. Our second goal in puzzle generation is to minimize the complexity of the generation algorithm. We note that since the size of the 9 × 9 Sudoku board is fixed, the order of growth of our generation algorithm is irrelevant. Therefore, in this paper, we will interpret the complexity of a generation algorithm to be the expected amount of execution time required to find a Sudoku puzzle of the desired difficulty level. 3 A Difficulty Metric In this section, we describe and evaluate a difficulty metric for Sudoku puzzles based upon a computer solver hsolve that simulates the actions of a human Sudoku expert. 3.1 Assumptions and Metric Development As stated in Section 2, we define the difficulty of a puzzle as the average amount of time required for an expert Sudoku solver to solve it. In order to measure this time, there are essentially two possibilities: 1. Model the process of solving the puzzle 2. Find some heuristic for board configurations that predicts the solve time. There are some known heuristics for finding the difficulty of a puzzle; most notably, puzzles with a small number of initial givens are somewhat harder than most. However, according to [9], the overall correlation is weak. Other methods of this type have similar problems, and, more importantly, they cannot be used to determine the difficulty of any specific puzzle. Therefore, the second possibility can be used at most as a guide for the first, and we must model the actual process of solving the puzzle. We postulate that the following assumptions hold for the solver: 1. The possible strategies can be ranked in order of difficulty, and the solver always applies them from least to most difficult. This is consistent with more or less all references on the subject of solving sudoku puzzles. We use a widely accepted ranking of strategies described in Appendix A. 2. During the search for a strategy application, each ordering of possible strategy applications occurs with equal probability. There are two components of a human search for a possible location to apply a strategy: complete search and intuitive pattern recognition. While human pattern recognition 5

Page 6 of 24 Control #2858 is extremely powerful (see, for example 2 ) it is extremely difficult to determine its precise consequences, especially due to possible differences between solvers. Therefore we do not consider any intuitive component to pattern recognition and restrict our model to a complete search for strategy applications. Such a search will proceed between possible applications in the random ordering that we postulate Ts We define a possible application of a strategy to be a configuration on the board that checked by a human to determine if the given strategy can be applied; a list of exactly which configurations are checked varies by strategy and is given in Appendix A. We can now model our solver as following the algorithm Human Solve defined as follows Definition. Given a sudoku puzzle, the algorithm Human Solve repeats the following steps until there are no remaining empty squares 1. Choose the least difficult strategy that has not yet been searched for in the current board configuration 2. Search through possible applications of any of these strategies for a valid application of a strategy 3. Apply the first valid application found We take the difficulty of a single run of Human Solve to be the total number of possible applications that the solver must check: here we assume that each check takes roughly have different difficulties due to different valid s of this method on the same puzzle may now ready to define the difficulty metric of a sudoku board B Definition. For a sudoku board B, take its difficulty metric m(B)to be the average total number of possible applications checked by the solver while using the Human Solve algorithm 3.2 hsolve and metric Calculation In order to actually calculate the difficulty m(B)of a board B, we use hsolve, a pre grain which simulates the running of Human Solve and calculates the resulting difficulty. hsolve is implemented in Java 1.6, and its source code is attached in Appendix B Given a Sudoku puzzle b, hsolve does the following Set the initial difficulty d=0 2. Repeat the following actions in order until B is solved or the solver cannot progress (a)Choose the tier of easiest strategies S that has not yet been searched for in the current board configuration (b)Find the number p of possible applications of S (c) Find the set V of all valid applications of S and compute the size v of V 6

Page 6 of 24 Control #2858 is extremely powerful (see, for example [2]), it is extremely difficult to determine its precise consequences, especially due to possible differences between solvers. Therefore, we do not consider any intuitive component to pattern recognition and restrict our model to a complete search for strategy applications. Such a search will proceed between possible applications in the random ordering that we postulate. We define a possible application of a strategy to be a configuration on the board that is checked by a human to determine if the given strategy can be applied; a list of exactly which configurations are checked varies by strategy and is given in Appendix A. We can now model our solver as following the algorithm HumanSolve defined as follows: Definition. Given a sudoku puzzle, the algorithm HumanSolve repeats the following steps until there are no remaining empty squares: 1. Choose the least difficult strategy that has not yet been searched for in the current board configuration. 2. Search through possible applications of any of these strategies for a valid application of a strategy. 3. Apply the first valid application found. We take the difficulty of a single run of HumanSolve to be the total number of possible applications that the solver must check; here we assume that each check takes roughly the same amount of time. Note that multiple runs of this method on the same puzzle may have different difficulties due to different valid applications being recognized first. We are now ready to define the difficulty metric of a sudoku board B. Definition. For a sudoku board B, take its difficulty metric m(B) to be the average total number of possible applications checked by the solver while using the HumanSolve algorithm. 3.2 hsolve and Metric Calculation In order to actually calculate the difficulty m(B) of a board B, we use hsolve, a program which simulates the running of HumanSolve and calculates the resulting difficulty. hsolve is implemented in Java 1.6, and its source code is attached in Appendix B. Given a Sudoku puzzle B, hsolve does the following: 1. Set the initial difficulty d = 0 2. Repeat the following actions in order until B is solved or the solver cannot progress: (a) Choose the tier of easiest strategies S that has not yet been searched for in the current board configuration. (b) Find the number p of possible applications of S. (c) Find the set V of all valid applications of S and compute the size v of V . 6

Page 7 of 24 Control #2858 (d) Compute E(p, v), the expected number of possible applications that will be examined before a valid application is found. (e) Increment d by E(p, v) · t, where t is the standard check time. Pick a random application in V and apply it to the board. 3. Return the value of d and the final solved board. While hsolve is mostly a direct implementation of HumanSolve, it does not actually perform a random search through possible applications; instead it uses the expected search time E(p, v) to simulate this search. Note that the following lemma gives an extremely convenient closed form expression for E(p, v) that we use in hsolve. Lemma 1. Assuming that all search paths through p possible approaches are equally likely, the expected number E(p, a) of checks required before finding one of v valid approaches is given by E(p, v) = p + 1 v + 1 . Proof. For our purposes, to specify a search path it is enough to specify the v indices of the valid approaches out of p choices, so there are p v possible search paths. Let I be the random variable equal to the smallest index of a valid approach. Then, we have that E(p, v) = p−Xv+1 i=1 iP(I = i) = p−Xv+1 i=1 p−Xv+1 j=i P(I = j) = p−Xv+1 i=1 P(I ≥ i) = 1 p v p−Xv+1 i=1 p + 1 − i v = 1 p v X p−v j=0 v + j v = p+1 v+1 p v = p + 1 v + 1 , where we’ve used the Hockeystick identity. Given a Sudoku puzzle B, then, we calculate m(B) by running hsolve several times and taking the average of the returned difficulties. To increase the level of accuracy, the number of runs can be increased; 20 runs per puzzle was found to give a ratio of standard deviation to mean of about σ µ ≈ 1 10 , so we use this value throughout the rest of the paper. 3.3 Analysis Our evaluation of hsolve consists of three major components: 1. Checking that hsolve’s conception of difficulty is correlated with existing conceptions of difficulty. 2. Comparing the distribution of difficulties generated by hsolve to established distributions for solve time. 3. Finding the runtime of the algorithm. 7

Page 8 of 24 Control #2858 3.3.1 Validation Against Existing Difficulty Ratings For each of the difficulty ratings in isupereasy, veryeasy, easy, medium, hard, harder veryhard, superhard], we downloaded a set of 100 puzzles from [8 to obtain a different difficulty rating to compare with. While this dataset is not a standard difficulty benchmark no other large datasets with varying difficulty ratings were available, and we are looking only for correlation, since there is no objective definition of difficulty We ran hsolve on each puzzle 20 times and recorded the average difficulty for each board We then classified the boards by difficulty into 8 groups of 100 boards based on our difficulty metric. The table of results is shown below Difficulty1|23|4|5678 supereasy 811900 reryeasy19681210000 08383318121 x2=6350(df=49) hard 0210192030118=0.82 harder00572226136|4 veryhard019716132727 superhard0004 2161 Performing a x2-test for independence and computing the Goodman-Kruskal y coefficient 61, we obtain that x=6350 and y=0.82. Note that this corresponds to a p value of less than 0.0001 for the x test, meaning that there is a statistically significant deviation from independence between these two measures of difficulty [7 Furthermore, the Goodman-Kruskal coefficient y=0.82 is relatively close to 1, indicating a somewhat strong correlation between our measure of difficulty and the existing metric This provides some support for the validity of our metric; more precise error analysis seems unnecessary here because we wish only to check that our values are close to those provided by others 3.3.2 Validation of Difficulty Distribution When run 20 times on each of 1000 typical Sudoku puzzles obtained from [12 , hsolve generates the following distribution for measured difficulty. As can be seen in Figure 1, the distribution is sharply peaked near 500 and has a long tail towards higher difficulty 4, We can compare this difficulty distribution plot with the distribution of times required visitorstowww.websudoku.comtosolvethepuzzlesavailablethere21.Thisdistribution is generated by the solution times of millions of users and is shown in the plot in Figure 2 The two distribution graphs both share a peak near 0 and have fat tails in the positive direction. While the tail of our measured difficulties is somewhat fatter, it exhibits the same qualitative behavior as a distribution of solve times generated by millions of real users, again providing validation for our difficulty metric

Page 8 of 24 Control #2858 3.3.1 Validation Against Existing Difficulty Ratings For each of the difficulty ratings in {supereasy,veryeasy,easy,medium,hard,harder, veryhard,superhard}, we downloaded a set of 100 puzzles from [8] to obtain a different difficulty rating to compare with. While this dataset is not a standard difficulty benchmark, no other large datasets with varying difficulty ratings were available, and we are looking only for correlation, since there is no objective definition of difficulty. We ran hsolve on each puzzle 20 times and recorded the average difficulty for each board. We then classified the boards by difficulty into 8 groups of 100 boards based on our difficulty metric. The table of results is shown below: Difficulty 1 2 3 4 5 6 7 8 supereasy 81 19 0 0 0 0 0 0 veryeasy 19 68 12 1 0 0 0 0 easy 0 8 38 33 18 2 1 0 medium 0 2 26 29 22 17 4 0 hard 0 2 10 19 20 30 11 8 harder 0 0 5 7 22 26 36 4 veryhard 0 1 9 7 16 13 27 27 superhard 0 0 0 4 2 12 21 61 χ 2 = 6350 (df = 49) γ = 0.82 Performing a χ 2 -test for independence and computing the Goodman-Kruskal γ coefficient [6], we obtain that χ 2 = 6350 and γ = 0.82. Note that this corresponds to a p value of less than 0.0001 for the χ 2 test, meaning that there is a statistically significant deviation from independence between these two measures of difficulty [7]. Furthermore, the Goodman-Kruskal coefficient γ = 0.82 is relatively close to 1, indicating a somewhat strong correlation between our measure of difficulty and the existing metric. This provides some support for the validity of our metric; more precise error analysis seems unnecessary here because we wish only to check that our values are close to those provided by others. 3.3.2 Validation of Difficulty Distribution When run 20 times on each of 1000 typical Sudoku puzzles obtained from [12], hsolve generates the following distribution for measured difficulty. As can be seen in Figure 1, the distribution is sharply peaked near 500 and has a long tail towards higher difficulty. We can compare this difficulty distribution plot with the distribution of times required for visitors to www.websudoku.com to solve the puzzles available there [21]. This distribution is generated by the solution times of millions of users and is shown in the plot in Figure 2. The two distribution graphs both share a peak near 0 and have fat tails in the positive direction. While the tail of our measured difficulties is somewhat fatter, it exhibits the same qualitative behavior as a distribution of solve times generated by millions of real users, again providing validation for our difficulty metric. 8