Over the past 20 years we have seen many different databases and the claims made by most data sellers and compilers. None of these claims are backed by any good quantification mechanism and thus we saw the need for a good quality scoring mechanism. We at E Media have built expertise in mining data from the unstructured sources. As we work with our customers to identify the sources and write complex code to compile data from constantly changing web pages, we realized the basic problem of measuring quality. To this end we have engineered our own system called the MEDIA STRUCTURED SCORING SYSTEM (MSSS) which allows us to score each record using a formula that considers weight of every field in the record. The average of all the scores in the dataset is the MSSS score of that dataset. Our goal is to never let our data score drop between two deliveries.

                                                      MSSS Method
For all the fields in the structured dataset, we assign a weight which conveys the importance of the field for the value of each record. The weight is in the range of 5 to 100. 100 weight indicates the highest importance of that field for that record to be valuable. 5 indicates good-to-have. For example, we are using following weights for our fields. We may tweak these based on the value the customer of our data places on these fields. 

Score of a record = 100* (Sum of weights of all the fields that are not empty)/1060
Score of a dataset = Average of scores of all the records
* 1060 is the max score a record can get by having all the fields available.

ID - 100                                            PHONE - 50

SIC CATEGORY - 100                         FAX - 50
COMPANY NAME - 100                       URL - 25
ADDRESS - 100                                 EMAIL - 50
CITY - 100                                        CONTACT PERSON - 10
COUNTY  - 50                                    CONTACT TITLE - 10
STATE - 100                                      ANNUAL SALES VOLUME - 5
ZIP - 100                                          NUMBER OF EMPLOYEES - 5
LONGITUDE - 50                               LATITUDE - 50

Basic quality of data:

MSSS method expects every dataset to follow basic rules of data sanity. It doesn’t consider quantity of records as a measure of quality.

a)      ID is always unique
b)      All blanks are converted into nulls
c)       All strings are trimmed
d)      Data sanity checks are added for every field.

For example:

a.       ZIP code can’t be more than X characters based on the country
b.       STATE and CITY must exist in that country
c.        EMAIL, FAX, and PHONE must have a valid format.
d.       LATITUDE and LONGITUDE must be within boundaries of that country and must be valid numbers
e.       ANNUAL SALES VOLUME must be within an acceptable range for that currency and country.
f.        NUMBER OF EMPLOYEES and YEARS IN BUSINESS must be within an acceptable range.
e.       COMPANY NAME, ADDRESS, CITY and  STATE (Phonetic name using SOUNDEX algorithm)   
          must be unique for that dataset.

