Not for the faint hearted

Results 1 to 3 of 3

Thread: Not for the faint hearted

  1. #1
    Join Date
    Dec 1969

    Default Not for the faint hearted

    I would like a regular expression that brings back the top 10 most repeated words in a paragraph of text. But i also want the regular expression to ignore common words such as &#039;a&#039;,&#039;the&#039; and &#039;and&#039;.<BR><BR>why do i need this? well i&#039;m trying to create meta tag keywords for an article and want the most repeated words from article to be included in the meta tag to inprove search indexing with sites such a google etc.

  2. #2
    Join Date
    Dec 1969

    Default RE: Not for the faint hearted

    i don&#039;t think this is a job for regexp. consider this:<BR>1. insert each word into a database table<BR>2. delete your common word like &#039;a&#039;, &#039;the&#039;, etc..<BR>3. then run a query like this<BR>SELECT TOP 10<BR> COUNT(word) AS hits<BR> , word<BR>FROM word_table<BR>GROUP BY word<BR>ORDER BY hits desc<BR><BR>

  3. #3
    Join Date
    Dec 1969

    Default Another way...

    Simply strip out all the "junk" in the text file, converting all punctuation and new line characters and and and to, say, spaces. (This you could do with a regular expression quite nicely, of course.)<BR><BR>Now SPLIT the text on the spaces to create an array of words.<BR><BR>Then sort the array and count any multiple word occurrences, keeping track of the top 10 counts, only.<BR><BR>No DB needed, so probably faster.<BR><BR>

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts