Count unique items in an array

# Thread: Count unique items in an array

1. Senior Member
Join Date
Dec 1969
Posts
11,334

## Count unique items in an array

Ok... I&#039;ve even answered this question before, but now I need some speed.<BR><BR>Currently, to count unique words, I do:<BR><BR>for loop for original array<BR> for loop for new array<BR> set boolean value to false<BR> if outer array element = inner array element then<BR> set inner array element + 1<BR> set boolean value to true<BR> break out of inner for loop<BR> increment a counter to count loops<BR> if boolean value is false then<BR> write word<BR> set count = 1<BR><BR><BR><BR><BR>Works good for small things, because I&#039;ve even suggested this at one point.<BR><BR>However, with the original array hovering around, say, 18,000 elements -- my counter to count the iterations runs around 3 million. (I shrunk this down considerably of 4.9 million by adding one line of code)<BR><BR><BR>Any ideas to speed this sucker up?<BR><BR>

2. Senior Member
Join Date
Dec 1969
Posts
96,118

## Can you get the array sorted???

If it&#039;s sorted, then it&#039;s just one pass through, right?<BR><BR>So since you are basically doing an order n-squared sort, maybe even the time it would take to sort and then count (once) would be worth it, if you could get the sort time down to "n log n", esp. with this many records.<BR><BR>

3. Senior Member
Join Date
Dec 1969
Posts
11,334

## I s'pose

But I don&#039;t understand how that&#039;ll shrink it down... you still have to compare every element in the original array to the new one to see if I should put it in there in the first place... no?<BR><BR>The largest one I have so far is 28,024 elements -- it takes 24.1 million iterations. Sorting that will help it?<BR><BR>BTW.. it&#039;s a vector, but same principle. Originally, (this is for an HTML search -- don&#039;t ask why I&#039;m doing this), if it had a "dirty word" a, and, the, or, etc, I would erase the element of the vector -- it just took much **** time rebuiding the thing over and over. So even when the iterations dropped, the time increased.<BR><BR>So try a sort, eh?

4. Senior Member
Join Date
Dec 1969
Posts
96,118

## You said duplicates...

...in *ONE* array. I guess I didn&#039;t read your code close enough!<BR><BR>I presume you can&#039;t sort *both* arrays, then? If you could, then it would be a simple merge process...still just one time through.<BR><BR>

5. Senior Member
Join Date
Dec 1969
Posts
11,334

## RE: You said duplicates...

Hmmm..<BR><BR>Imagine the original array has every word in an HTML document. So something like<BR><BR>element1 = "hi"<BR>element2 = "there"<BR>element3 = "hi"<BR><BR>The new array would be<BR>"hi" - 2<BR>"there" - 1<BR><BR>Just a &#039;word count&#039;, if you will.<BR><BR>It&#039;s not bad for one page... I just have to do this to 6,000 of &#039;em and it may take awhile :)<BR>

6. Senior Member
Join Date
Dec 1969
Posts
96,118

## Oh...then I was right...

Sort the input array of words.<BR><BR>Sort the "dictionary".<BR><BR>Go through them in parallel:<BR><BR>&#060;%<BR>dcount = 0<BR>priorWord = ""<BR><BR>For hcount = 0 To UBound(htmlWords)<BR> word = htmlWords(hcount)<BR> If word &#060;&#062; priorWord Then<BR> If priorWord &#060;&#062; "" Then<BR> Do While dict(0,dcount) &#060; priorWord<BR> &#039; skip words in dict not in html<BR> dcount = dcount + 1<BR> Loop<BR> If dict(0,dcount) = priorWord Then<BR> dict(1,dcount) = dict(1,dcount) + priorCount<BR> Else<BR> addcount = addcount + 1<BR> adddict(0,addcount) = priorWord<BR> adddict(1,addcount) = priorCount<BR> End If<BR> End If<BR> priorWord = word<BR> priorCount = 0<BR> End If<BR> priorCount = priorCount + 1<BR>Next<BR>%&#062;<BR><BR>Doesn&#039;t show code needed to increase size of addDict array.<BR><BR>After this process, assuming you are doing cumulative totals, you just do a merge of the dict with the adddict arrays.<BR><BR><BR>

7. Senior Member
Join Date
Dec 1969
Posts
11,334

## Ah... I see

So you&#039;re just looking at n and n + 1 word at a time...<BR><BR>I just hope the sorting doesn&#039;t counteract the time... I&#039;ll give &#039;er a shot.<BR><BR>Thanks!

8. Senior Member
Join Date
Dec 1969
Posts
11,334

## Wow

Well, for the one with 17,411 elements, I went from 16 seconds to execute with 3,095,363 iterations.<BR><BR>Sorting the vector (using the &#060;algorithm&#062; class) and just checking the n + 1 word, I&#039;m now at 4 seconds and 1427 iterations for the same 17,411 vector.<BR><BR>Thanks a lot for that idea!

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•