Fuzzy String Matching – a survival skill to tackle unstructured information

You may also like...

10 Responses

  1. Michael B says:

    Did you see this python package : https://github.com/datamade/dedupe

    It sounds awesome (I am fighting with my Windows laptop to make it work…).
    Basically, I tried few weeks ago the same strategy that the one you exposed. The issue is that the complexity has a square progression, and for 30K items in both database you have 900K comparaison to do. stringdist works in memory… so basically not enough RAM rapidly.
    I don t know about the time it would take.

    Manay papers present an interesting amelioration in adding blocking rules. Basically you select groups of strings which can be compared together. A final amelioration is a machine learning step learning the best blocking rules. Dedupe go even deeper in learning interactlively. I let you read the documentation. There is record linkage package on R for similar stuff but a lot less advanced.

    If you are interested in such method, drop me a message, I am doing some experiences on my side right now.

    Kind regards,
    Michael

  2. Forest Gregg says:

    Author of dedupe here.

    We just set up continuous integration for Windows, so hopefully some of these difficulties will be smoothed over. If you are still having issues, please report them at https://github.com/datamade/dedupe/issues

  3. Kristoffer Vitting-Seerup says:

    I have in some sitiuations used agrep() with sucess.

  4. Danny says:

    Great article! While I’m very new to R, I was able to get the first method to work for me. I am matching transactions from two different data sets so I was able to match concatenated values of NAME and AMOUNT with some success.

    I’m finding about 1/3 of the data that is 100% match (name and dollar amount), then come the issues: taxes, rounding errors, shipping costs, and multiple purchases from one data set can be tied to a single entry in the other data set.
    Is there any way to use method 1 to match first on the name, then run again on the amount AFTER the names have been matched up?

    Thanks in advance,
    Danny

  5. leerssej says:

    To avoid have to recast your dataframe contents as factors after the read.csv load, you can just add the `StringsAsFactor = False` argument. (For Example: `source1.devices<-read.csv('source1.csv', stringsAsFactors = F)
    source2.devices<-read.csv('source2.csv', stringsAsFactors = F)`

  1. February 26, 2015

    […] article was first published on Big Data Doctor » R, and kindly contributed to […]

  2. February 27, 2015

    […] How to combine different sources of unstructured information using Fuzzy String Matching: a step-by-step tutorial in R  […]

  3. February 27, 2015

    […] Fuzzy String Matching – a survival skill to tackle unstructured information “The amount of information available in the internet grows every day” thank you captain Obvious! by now even my grandma is aware of that!. Actually, the internet has increasingly become the first address for data people to find good and up-to-date data. But this is not and never has been an easy task. Even if the Semantic Web was pushed very hard in the academic environments and in spite of all efforts driven by the community of internet visionaries, like Sir Tim Berners-Lee, the vast majority of the existing sites don’t speak RDF, don’t expose their data with microformats and keep giving a hard time to the people trying to programmatically consume their data. […]

  4. February 28, 2015

    […] “How to combine different sources of unstructured information using Fuzzy String Matching: a step-by-step tutorial in R”  […]

  5. May 1, 2016

    … [Trackback]

    […] Read More Infos here: bigdata-doctor.com/fuzzy-string-matching-survival-skill-tackle-unstructured-information-r/ […]

Leave a Reply

Your email address will not be published. Required fields are marked *

Spammers stay away! *